Foreword

Among the Python libraries, Numpy, Pandas, Matplotlib, Scipy, and scikit-learn are important libraries and modules commonly used for data analysis, data science, and machine learning.

In fact, Pandas is not a tool for data science, but a tool for the pre-analytical stage of data science. This article will mainly introduce the basic use of important data structures Series and DataFrame in Pandas.

Knowledge of Pandas

Pandas is a very useful library for data preprocessing and analysis. It combines the characteristics of Numpy and has data manipulation capabilities similar to Excel and SQL, allowing us to perform various flexible processing of various data in the form of DataFrame.

The main features of Pandas are as follows:

  • Provides two main data structures: Series, DataFrame. Series is mainly used to create a one-dimensional array of indexes to process data related to time series; while DataFrame is a two-dimensional data set with column indexes and column labels, similar to Excel’s data table or SQL’s relational database .

  • Through the data structure of Pandas and the method of structuring objects, the data is processed in a more diverse manner. Such as filling, deleting or replacing data with null values.

  • Easier to read, convert and process heterogeneous data. For example, to find out a column or row that meets a certain condition from the data, or to separate and combine the data, etc.

  • More diverse input sources and integrated output methods. For example, you can read data from CSV or database into DataFrame, and you can also convert processed data into CSV or database.

Through Pandas, we can overcome the limitations of Excel, and also make the work of data analysis more convenient and easier, and can more quickly discover the information in the data and its meaning.

Install

pip3 install pandas

Import

import pandas as pd

Create Series

Series are objects similar to one-dimensional arrays. We can use Python’s list data type or dictionary data type to create a Series:

1. Use list to create Series

list_1 = [10, 20, 30, 40, 50, 60]
data = pd.Series(list_1)
print(data)
# output
0    10
1    20
2    30
3    40
4    50
5    60
dtype: int64

The left part is index and the right part is value . In addition to specifying the value of the data, we can also specify the index value of the data through the index parameter:

list_1 = [10, 20, 30, 40, 50, 60]
data = pd.Series(list_1,index=['a','b','c','e','d','f'])
print(data)
# output
a    10
b    20
c    30
e    40
d    50
f    60
dtype: int64

2. Use dictionary to create Series

dict1 = {'a':10, 'b':20, 'c':30, 'd':40, 'e':50}
data = pd.Series(dict1)
print(data)
# output
a    10
b    20
c    30
d    40
e    50
dtype: int64

For the index and value of the data, we can use the index attribute and the values attribute to retrieve them respectively:

dict1 = {'a':10, 'b':20, 'c':30, 'd':40, 'e':50}
data = pd.Series(dict1)
print(data.index)
print(data.values)
# output
# print(data.index)
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
# print(data.values)
[10 20 30 40 50]

It is worth noting that the custom index value of Series does not apply to the data type of dict, because the dict itself has a Key to specify the index value of the Series, and the value of the custom index will become NaN (no data).

dict1 = {'a':10, 'b':20, 'c':30, 'd':40, 'e':50}
data = pd.Series(dict1,index=['f', 'g', 'h', 'i', 'j'])
print(data)
# output
f   NaN
g   NaN
h   NaN
i   NaN
j   NaN
dtype: float64

But we can change the order of the data by changing the order of the index:

dict1 = {'a':10, 'b':20, 'c':30, 'd':40, 'e':50}
data = pd.Series(dict1,index=['b', 'a', 'd', 'e', 'c'])
print(data)
# output
b    20
a    10
d    40
e    50
c    30
dtype: int64

Create DataFrame

DataFrame is a two-dimensional data object with index and column, which is the most important data structure in Pandas. Basically, when we use Pandas for data manipulation and analysis, we mostly use the data structure of DataFrame.

DataFrame can set different dtypes for each column, such as int, str, bool, etc. In addition to using dictionary and ndarray to create DataFrame, we can also create it by reading external data (such as: CSV, SQL, JOSN, HTML).

1. Use dictionary to create DataFrame

To create a DataFrame with a dictionary, just convert the data in the dict type into a DataFrame:

dict_student = {'ID':[1100101, 1100102, 1100103, 1100104, 1100105],'Sex':['f', 'm', 'f', 'f', 'm'],'Chinese':[60, 70, 77, 69, 70], 'Math':[66, 75, 74, 88, 94]}
df = pd.DataFrame(dict_student)
print(df)
# output
        ID Sex  Chinese  Math
0  1100101   f       60    66
1  1100102   m       70    75
2  1100103   f       77    74
3  1100104   f       69    88
4  1100105   m       70    94

2. Use ndarray to create DataFrame

We can also use a simple array to create a DataFrame, and set the column names through the columns parameter:

student =(np.array([[1100101, 'f', 60, 66], [1100102, 'm', 70, 75], [1100103, 'f', 77, 74], [1100104, 'f', 69, 88],[1100105, 'm', 70, 94]]))
df_student = pd.DataFrame(student,columns=['ID', 'Sex', 'Chinese', 'Math'])
print(df_student)
# output
        ID Sex Chinese Math
0  1100101   f      60   66
1  1100102   m      70   75
2  1100103   f      77   74
3  1100104   f      69   88
4  1100105   m      70   94

3. Read external data to create DataFrame

As mentioned earlier, we can read external data, such as CSV, SQL, JOSN, HTML, to convert it into a DataFrame data structure. For example, if we want to load a CSV file, we can write it like this:

df = pd.read_csv('檔名')
df = pd.read_html('檔名')

4. Index of custom data

The DataFrame is the same as the Series, and the index value of the data can also be customized through the index parameter:

dict_student = {'ID':[1100101, 1100102, 1100103, 1100104, 1100105], 'Sex':['f', 'm', 'f', 'f', 'm'], 'Chinese':[60, 70, 77, 69, 70], 'Math':[66, 75, 74, 88, 94]}
df_student_index = pd.DataFrame(dict_student,index=['a', 'b', 'c', 'd', 'e'])
print(df_student_index)
# output
        ID Sex  Chinese  Math
a  1100101   f       60    66
b  1100102   m       70    75
c  1100103   f       77    74
d  1100104   f       69    88
e  1100105   m       70    94

It should be noted that in the index value of the custom DataFrame, data is applicable to data that has not been converted into a DataFrame, such as the above example data = dict_student or the above example data = student :

student =([1100101, 'f', 60, 66], [1100102, 'm', 70, 75], [ 1100103, 'f', 77, 74], [1100104, 'f', 69, 88], [1100105, 'm', 70, 94])
df_student = pd.DataFrame(student,columns=['ID','Sex','Chinese','Math'],index=['a', 'b', 'c', 'd', 'e'])
print(df_student)
# output
        ID Sex  Chinese  Math
a  1100101   f       60    66
b  1100102   m       70    75
c  1100103   f       77    74
d  1100104   f       69    88
e  1100105   m       70    94

In the process of converting to DataFrame, the value of index has been assigned, as follows:

# convert to df
dict_student = {'ID':[1100101, 1100102, 1100103, 1100104, 1100105], 'Sex':['f', 'm', 'f', 'f', 'm'], 'Chinese':[60, 70, 77, 69, 70], 'Math':[66, 75, 74, 88, 94]}
df = pd.DataFrame(dict_student)
print(df)
        ID Sex  Chinese  Math
0  1100101   f       60    66
1  1100102   m       70    75
2  1100103   f       77    74
3  1100104   f       69    88
4  1100105   m       70    94 

When the value of index is customized, the original data will become NaN (no data), which is similar to the situation where the index value of the Series dict needs to be determined.

# Add a custom index to the data converted to df
df_student_index = pd.DataFrame(df,index=['a', 'b', 'c', 'd', 'e'])
print(df_student_index)
   ID  Sex  Chinese  Math
a NaN  NaN      NaN   NaN
b NaN  NaN      NaN   NaN
c NaN  NaN      NaN   NaN
d NaN  NaN      NaN   NaN
e NaN  NaN      NaN   NaN

Let’s actually verify what it looks like when the data is converted to a DataFrame and then converted back to a dictionary:

dict_student = {'ID':[1100101, 1100102, 1100103, 1100104, 1100105], 'Sex':['f', 'm', 'f', 'f', 'm'], 'Chinese':[60, 70, 77, 69, 70], 'Math':[66, 75, 74, 88, 94]}
df = pd.DataFrame(dict_student)
print(df.to_dict())
{'ID': {0: 1100101, 1: 1100102, 2: 1100103, 3: 1100104, 4: 1100105}, 'Sex': {0: 'f', 1: 'm', 2: 'f', 3: 'f', 4: 'm'}, 'Chinese': {0: 60, 1: 70, 2: 77, 3: 69, 4: 70}, 'Math': {0: 66, 1: 75, 2: 74, 3: 88, 4: 94}} 

We can see that the data that has been converted into a DataFrame has indeed been assigned an index value.

5. Change column order, add new column

In the process of data analysis, we may need to change the order of columns or add new columns. These requirements can also be helped us through the column parameter.

If you want to change the column order, you can write it like this:

dict_student = {'ID':[1100101, 1100102, 1100103, 1100104, 1100105], 'Sex':['f', 'm', 'f', 'f', 'm'], 'Chinese':[60, 70, 77, 69, 70], 'Math':[66, 75, 74, 88, 94]}
df = pd.DataFrame(dict_student)
print(df)

# method 1
df_columns1 = pd.DataFrame(df,columns=['ID', 'Sex', 'Math', 'Chinese'])
# method 2
df_columns2 = pd.DataFrame(dict_student,columns=['ID', 'Sex', 'Math', 'Chinese'])
print(df_columns2)
# output
# original data order
        ID Sex  Chinese  Math
0  1100101   f       60    66
1  1100102   m       70    75
2  1100103   f       77    74
3  1100104   f       69    88
4  1100105   m       70    94
# Changed data order
        ID Sex  Math  Chinese
0  1100101   f    66       60
1  1100102   m    75       70
2  1100103   f    74       77
3  1100104   f    88       69
4  1100105   m    94       70

If you want to add a new column, you can write it like this:

# method 1
df_columns_add1 = pd.DataFrame(df,columns=['ID', 'Sex', 'Math', 'Chinese', 'English']
# method 2
df_columns_add2 = pd.DataFrame(dict_student,columns=['ID', 'Sex', 'Math', 'Chinese', 'English'])
print(df_columns_add2)
# output 
        ID Sex  Math  Chinese  English
0  1100101   f    66       60      NaN
1  1100102   m    75       70      NaN
2  1100103   f    74       77      NaN
3  1100104   f    88       69      NaN
4  1100105   m    94       70      NaN

You can see that an English column has been successfully added, and then the English value is given through the list:

list_English = [70,88,67,89,97]
df_columns_add2['English'] = list_English
print(df_columns_add2)
# output 
        ID Sex  Math  Chinese  English
0  1100101   f    66       60       70
1  1100102   m    75       70       88
2  1100103   f    74       77       67
3  1100104   f    88       69       89
4  1100105   m    94       70       97

DataFrame data information

After we have created the DataFrame, we can use the methods or properties in the DataFrame data structure to view the information of the data. The more common methods are: .shape, .head(), .tail(), .describe(), .index, .columns . Let’s take df_student below as an example, let’s take a look at these methods in practice.

dict_student = {'ID':[1100101, 1100102, 1100103, 1100104, 1100105], 'Sex':['f', 'm', 'f', 'f', 'm'], 'Chinese':[60, 70, 77, 69, 70], 'Math':[66, 75, 74, 88, 94], 'English':[70, 88, 67, 89, 97]}
df_student = pd.DataFrame(dict_student)
print(df_score)
# print(df_student) 
        ID Sex  Math  Chinese  English
0  1100101   f    66       60       70
1  1100102   m    75       70       88
2  1100103   f    74       77       67
3  1100104   f    88       69       89
4  1100105   m    94       70       97

1. View data row, column

When we want to see how many rows and columns df_student has, we can use .shape, and the result is (row, column):

# .shape
print(df_student.shape)
# output
(5, 5)

2. View the value of the previous n data

When we want to view the values of the previous records of df_student, we can use .head(). The default is to display 5 records. We can also specify the number of records to be displayed by substituting a value in .head(), such as viewing the first 2 records:

# .head()
print(df_student.head(2))
# output
        ID Sex  Math  Chinese  English
0  1100101   f    66       60       70
1  1100102   m    75       70       88

3. View the value of the last n records

When we want to see the value of the last few records of df_student, we can use .tail(). The default is also to display 5 records. We can specify the number of records to be displayed by substituting a value in .tail(), for example, to view the last 2 records:

# .tail()
print(df_student.tail(2))
# 輸出
        ID Sex  Math  Chinese  English
3  1100104   f    88       69       89
4  1100105   m    94       70       97

*It should be noted that both .head() and .tail() can only display up to 5 pieces of data

4. View descriptive statistics for a profile

When we want to view the descriptive statistics of df_student, we can use .describe(), through which we can more easily understand the summary of the entire data:

# .describe()
print(df_student.describe())
# output 
                 ID       Math    Chinese    English
count  5.000000e+00   5.000000   5.000000   5.000000
mean   1.100103e+06  79.400000  69.200000  82.200000
std    1.581139e+00  11.349009   6.058052  13.026895
min    1.100101e+06  66.000000  60.000000  67.000000
25%    1.100102e+06  74.000000  69.000000  70.000000
50%    1.100103e+06  75.000000  70.000000  88.000000
75%    1.100104e+06  88.000000  70.000000  89.000000
max    1.100105e+06  94.000000  77.000000  97.000000

5. View the index of the profile

When we want to see the index of df_student, we can use the .index property:

# .index
print(df_student.index)
# 輸出
RangeIndex(start=0, stop=5, step=1)

6. View all column names of the data

When we want to see all the column names in df_student, we can use the .columns property:

# .columns
print(df_student.columns)
# output
Index(['ID', 'Sex', 'Math', 'Chinese', 'English'], dtype='object')

7. View the content of the profile

To view the data content of df_student, we can use the properties of .info:

# .info
print(df_student.info)
# output
<bound method DataFrame.info of ID Sex  Math  Chinese  English
0  1100101   f    66       60       70
1  1100102   m    75       70       88
2  1100103   f    74       77       67
3  1100104   f    88       69       89
4  1100105   m    94       70       97>

8. Convert row to column

In addition to these methods for viewing data information, DataFrame also provides operations like matrix transpose, allowing the .T methods to swap rows and columns:

# .T
print(df_student.T)
# output
               0        1        2        3        4
ID       1100101  1100102  1100103  1100104  1100105
Sex            f        m        f        f        m
Math          66       75       74       88       94
Chinese       60       70       77       69       70
English       70       88       67       89       97

When we do data processing before, we often use these methods to help us better understand the content and overview of the entire data, and then perform data processing and processing, such as: data extraction, data deletion, merging or Sorting, etc., and even more advanced operations such as handling of nulls and outliers.

Then, due to the limited space of this article, in the next article, let’s learn about Pandas’ data preprocessing and how to operate DataFrame!

After reading this, do you know more about Pandas?~