Hi thereππ»
I hope you doing well.
Today, I would like to explain the basic knowledge of pandas based on the official tutorial.
What is pandas ?
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
Basic data structures in pandas
Series
: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.DataFrame
: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.
Read and write data
Read data from file
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv', index_col=False)
import pandas
: Make pandas availableas pd
: The part to write “pandas” can be written by omitting “pd”pd.read_csv('name of file')
:pd.read_csv
is a pandas function and is used to read CSV files.index_col=False
: If there is no index column in the read file, add index_col=False. Then the index will be automatically added on loading.
Read data by direct description
import pandas as pd
import numpy as np
series = pd.Series([1,2,3])
df = pd.DataFrame([['sato', 18], ['suzuki', 20]])
pd.Series
: a pandas function to convert a list or array into a one-dimensional labeled array.pd.Dataframe
: a pandas function to convert lists, arrays, dictionaries, etc. into two-dimensional labeled tables.
Write data to CSV file
data.to_csv('sample.csv', index=False)
to_csv
: a pandas function to convert the data variable to a csv file and save it.index=False
: index is not needed for data analysiss, so it is not saved.
Viewing data
a concise summary of a DataFrame
data.info()
It provides the following information:
- Index (RangeIndex): Shows the index range (starting and ending) of the DataFrame.
- Column names: Lists all the column names in the DataFrame.
- Non-null count: Displays the number of non-null (non-missing) values in each column.
- Data type (Dtype): Shows the data type of each column (e.g., int64, float64, object, datetime64, etc)
show the type of the object
type(data)
When working with Pandas, if data is a DataFrame or a Series, type(data) will return:
- pandas.core.frame.DataFrame if data is a DataFrame.
- pandas.core.series.Series if data is a Series.
Change the data type for each column
data['name of column'].astype(float)
#Change data type of 'name of column' to float
Display the number of matrices
data(shape)
#Example result (5,5)
#(number of rows, columns) can be obtained
Display the number of rows
len(data)
Dispaly the number of columns
len(data.columns)
Display total number of elements
data.size
#Equal to number of rows * number of columns
Display the first row
data.head()
#Display 5 lines of data from the top
data.head(2)
#Display 2 lines of data from the top
Dispaly the last row
data.tail()
#Display 5 lines of data from the end
data.tail(2)
#Display 2 lines of data from the end
Dispaly optional rows
data[10:20]
#Display data from rows 10 to 19
data[:5]
#Display 5 lines of data from the top
data[-5:]
#Display 5 lines of data from the end
data.loc['name of row', 'name of column']
#Display data for specified row/column name
data,loc['name of row', 'name of column'] = 'Data you want to insert'
#Data can be inserted into specified rows and columns
data.iloc[0]
#Display data from row with index 0 (first row)
data.iloc[-1]
#DIsplay the data of the last row
data[data['name of column'] >= 100]
#Display rows with a value of 100 or more in the specified column
data[data['name of column'].max()]
#Display rows with the largest value in the specified column
Display optional columns
data[['name of column']]
#Dispaly data for specified column name
data[['name of column 1', 'name of column 2']]
#Display data in multiple specified columns
Display optional value
data.iloc[0]['name of column']
#Displays the value of the 'column name' element in the first row
Counting unique elemants
data.unique()
#Display list of unique element values
data.value_counts()
#Displays unique element values and their number of appearances
data.nunique()
#Display the number of unnique elements
data.nunique(axis=1)
#Display the number of unique elements for each row
axis=0
: Default value. Vertical, in other words, an operation on each column.axis=1
: Operate horizontally, in other words, for each row.
Data totalization
Data connection
data = pd.concat((data, add_data), axis=0, sort=True)
#Concatenate data and add_data
data = data.reset_index()
#Reassign the index. The original index is stored in the column `index`
data.drop(['index'], axis=1, inplace=True)
#Delete the `index` column created above
sort=True
: Specifies whether the concatenated columns should be sorted in dictionary order.how='inner'
: Adding this option when concatenating results in an inner join and duplicate columns/rows are deleted..drop
: Deletes rows and columns with specified labels.inplace=True
: Manipulates the current data directly; if False, new data is created.
Caluculaion
data.sum()
#Displays the sum of each column of data
data.sum(axis=1)
#Displays the sum of each row of data
data.mean(axis=1)
#Displays the mean of each row of data
data.describe().round(1)
#Calculate summary statistics for each column of data to one decimal place
data.corr()
#Calculate the correlation coefficient for each column of data
Data formatting
Duplicate/Unique value
data['name of column'].unique()
#Display unique values for specified columns
data[~data['name of column'].duplicated()]
#Extract only unique values for a given column
data.drop_duplicates('name of column', inplace=True)
#Delete rows with duplicate values in the specified column
.duplicated()
: A method that returns a boolean value as to whether the element is a duplicate or not.~
: Tilde. A method to invert boolean values..drop_duplicates()
: Method to delete duplicate rows.
Missing value
data.isnull()
#Checks all values and returns a boolean value depending on the existence of missing values
data.isnull().all()
#Determine if there are missing values in each column. Returns False if there is even one missing value in a column
data.isnull().all(axis=1)
#Determine if there are missing values in each row. Returns False if there is even one missing value in a row
data.isnull().any()
#Display columns with at least one missing value
data.isnull().any(axis=1)
#Display rows with at least one missing value
data.dropna(how='any')
#Rows that contain even one missing value are deleted
data.dropna(how='all')
#Rows that are all missing values are deleted
data.dropna(how='all', axis=1)
#Columns that are all missing values are deleted
.isnull()
, .isna()
: Determine for each element and display True if there is a missing value..notnull()
, .notna()
: Determine for each element and display True if the value exists.
String processing
Data processing for optional strings
data.str.startswith('text string')
#Returns True for elements beginning with 'text string'
data[data.str.startswith('text string')]
#Display rows including elements beginning with 'text string'
data.str.endswith('text string')
#Returns True for elements ending in 'text string'.
data.str.containswith('text tring')
#Returns True for elements including 'text string'.
data['name of column'].str[0:2]
#Slices the string contents of a column with the specified column name to the first two characters of that string only.
data = data.str.strip()
#Delete leading and trailing blank characters in data.
data = data.str.strip('x')
#Delete 'x' in data.
data['name of column'][index] = data['name of column'][index].strip('x')
#Deletes the string 'x' from the value of the specified column name * index.
I hope you will make use of this article in your studies!
Thanks for reading this far!
See you in the next post!