Python Pandas tutorial (beginners guide to data analysis)
Pandas, an open-source, high-performance library that is used for various purposes such as data analysis, machine learning, and feature engineering.
One of the critical tools in a data scientist’s arsenal is the Pandas library, a powerful Python package for data analysis and manipulation.
- 1 Python Pandas Installation
- 2 Pandas Data Structures
- 3 Import CSV Files
- 4 Import Excel Files
- 5 Import JSON Files
- 6 Import SQL Files
- 7 Import HTML
- 8 Import XML Files
- 9 Import Parquet Files
- 10 Selecting and Indexing
- 11 Export DataFrame
- 12 Pickling & Unpickling
- 13 Handling Other File Formats
- 14 DataFrame Filtering
- 15 Data Cleaning
- 16 Aggregating Data
- 17 Grouping Data
- 18 Pivot Tables
- 19 Correlations
- 20 Descriptive Statistics
- 21 Merging DataFrames
- 22 Concatenating DataFrames
- 23 Reshaping DataFrames
- 24 Column Operations
- 25 Date and Time Handling
- 26 Data Visualization
- 27 Pandas comes to the rescue
Python Pandas Installation
Open your command line (terminal for Mac/Linux, command prompt for Windows) and type the following command:
pip install pandas
If you are using a Python environment like Anaconda, you can use the conda package manager to install pandas:
conda install pandas
After you’ve installed it, you can import pandas in your Python script like so:
import pandas as pd
It’s customary to import pandas under the alias pd
for brevity.
Pandas Data Structures
Pandas provides mainly 2 data structures – the DataFrame and the Series. These two structures are built on top of NumPy, another powerful Python package, making data manipulation in pandas a flexible and efficient task.
Let’s take a quick look at these data structures:
DataFrames
A DataFrame is a two-dimensional labeled data structure with columns of different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. They are generally the most commonly used Pandas object.
import pandas as pd data = { 'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 24, 35, 32], 'City': ['New York', 'Paris', 'Berlin', 'London'] } df = pd.DataFrame(data) print(df)
Output:
Name Age City 0 John 28 New York 1 Anna 24 Paris 2 Peter 35 Berlin 3 Linda 32 London
The first row of numbers (0 to 3) is the index of the DataFrame, and it was automatically generated.
Series
A Series is a one-dimensional labeled array that can hold any data type. It’s essentially a single column of a DataFrame, and the axis labels are collectively referred to as the index.
Here is how you can create a Pandas Series from Python’s built-in data types – list, dictionary, and tuple:
import pandas as pd import numpy as np s = pd.Series([1, 3, 5, np.nan, 6, 8]) print(s)
Output:
0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
We just created a pandas Series from a list, which can contain any data type. Note that np.nan was used to represent a missing value.
Import CSV Files
Assuming we have a CSV file named ‘sample.csv’ in the same directory as our Python script, we can import this data as follows:
import pandas as pd df = pd.read_csv('sample.csv') print(df.head())
Assume the output as follows:
ID Age Gender Occupation Salary Experience 0 1 35 M Engineer 50000 10.0 1 2 42 F Scientist 65000 12.0 2 3 30 M Astronaut 82000 9.0 3 4 25 F Data Analyst 62000 5.0 4 5 NaN M Technician 48000 NaN
The read_csv
reads CSV files and converts them into a DataFrame. The head()
function retrieves the first five rows from the DataFrame, which is useful for quickly testing if your object has the right type of data in it.
The sample data includes columns for ID, Age, Gender, Occupation, Salary, and Experience.
Note that there are some missing values in the Age and Experience columns and some duplicates which we will handle in the upcoming sections.
Import Excel Files
Pandas is also capable of importing data from Excel files. If your data is in an Excel file, you can read it into a pandas DataFrame using the read_excel
function. Let’s see how to do this.
import pandas as pd df_excel = pd.read_excel('sample.xlsx') print(df_excel.head())
Output:
ID Age Gender Occupation Salary Experience 0 1 35 M Engineer 50000 10.0 1 2 42 F Scientist 65000 12.0 2 3 30 M Astronaut 82000 9.0 3 4 25 F Data Analyst 62000 5.0 4 5 NaN M Technician 48000 NaN
The read_excel
function reads an Excel file into a pandas DataFrame. Similar to read_csv
.
The data structure and values from the Excel file are similar to the previous CSV example, complete with missing values.
Import JSON Files
Another common data format that you might encounter is JSON (JavaScript Object Notation), especially when working with data from web services and APIs.
Pandas provides a function, read_json
, to easily import JSON files into a pandas DataFrame.
Let’s import a JSON file named ‘sample.json’:
import pandas as pd df_json = pd.read_json('sample.json') print(df_json.head())
Output (depends on your file):
ID Age Gender Occupation Salary Experience 0 1 35 M Engineer 50000 10.0 1 2 42 F Scientist 65000 12.0 2 3 30 M Astronaut 82000 9.0 3 4 25 F Data Analyst 62000 5.0 4 5 NaN M Technician 48000 NaN
The read_json
function reads JSON file into a DataFrame. As with the previous example.
Read more on how to read JSON files using Pandas read_json.
Import SQL Files
Pandas provides the read_sql
function, which allows you to execute a SQL query and get the result as a Pandas DataFrame.
Let’s suppose you have a SQLite database (named ‘sample.db’) and you want to read data from a table called ’employees’.
First, you need to establish a connection with the database. Python has several libraries to connect with various databases – we’ll use sqlite3 for SQLite database:
import pandas as pd import sqlite3 conn = sqlite3.connect('sample.db') sql_query = "SELECT * FROM employees" df_sql = pd.read_sql(sql_query, conn) conn.close() print(df_sql.head())
The output will be:
ID Age Gender Occupation Salary Experience 0 1 35 M Engineer 50000 10.0 1 2 42 F Scientist 65000 12.0 2 3 30 M Astronaut 82000 9.0 3 4 25 F Data Analyst 62000 5.0 4 5 NaN M Technician 48000 NaN
Read more on how to read SQL query into DataFrame using read_sql.
Import HTML
Pandas also provides functionality to read HTML tables directly into a DataFrame using the read_html()
function.
It searches for table
elements and only for tr
, th
, and td
tags within each table
.
You will also need to have lxml
, beautifulsoup4
and html5lib
Python packages installed. You can install them via pip:
!pip install lxml beautifulsoup4 html5lib
Let’s consider a simple example where we’re trying to fetch tables from a Wikipedia page.
import pandas as pd url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue' dfs = pd.read_html(url) print(f"Total tables: {len(dfs)}") print("First table:") print(dfs[0])
Output:
Total tables: 5 First table: Rank Name ... State-owned Ref. Rank Name ... State-owned Ref. 0 1 Walmart ... NaN [1] 1 2 Amazon.com, Inc. ... NaN [4] ... ...
Read more on how to read HTML tables using Pandas read_html.
Import XML Files
Suppose you have a file named shapes.xml
:
<data> <row> <shape>square</shape> <degrees>360</degrees> <sides>4.0</sides> </row> <row> <shape>circle</shape> <degrees>360</degrees> </row> </data>
You can read this file into a DataFrame like this:
import pandas as pd df = pd.read_xml("shapes.xml") print(df)
Output:
shape degrees sides 0 square 360 4.0 1 circle 360 NaN
Read more about Parsing XML Files into DataFrames using Pandas read_xml.
Import Parquet Files
To read a parquet file, you will use the read_parquet
function from pandas. This function receives as argument the path to the parquet file you want to read.
import pandas as pd
df = pd.read_parquet('data.parquet')
The output of the read_parquet
function is a DataFrame. All the data from your parquet file is now loaded into this DataFrame, and you can start manipulating and analyzing it.
To be able to use the read_parquet
function, you will need to have the Pyarrow or fastparquet library installed.
pip install pyarrow fastparquet
Read more on how to read Parquet files using Pandas read_parquet.
Selecting and Indexing
In Pandas, there are several ways to select and index data. Let’s dive into some of these using the given sample data.
import pandas as pd data = { 'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 24, 35, 32], 'City': ['New York', 'Paris', 'Berlin', 'London'] } df = pd.DataFrame(data)
Slicing Ranges
We can use Python’s list slicing syntax directly on a DataFrame to select a range of rows.
# Selecting rows 1 through 2 (end index is exclusive) print(df[1:3])
Output:
Name Age City 1 Anna 24 Paris 2 Peter 35 Berlin
Using Label (.loc)
The .loc
property is used with the label of the rows or columns. Here, labels are the index values.
# Selecting the row with index 1 print(df.loc[1])
Output:
Name Anna Age 24 City Paris Name: 1, dtype: object
Using Integer Location (.iloc)
You can use the .iloc
property with the integer location to select a row. Here, the location is like the index in a Python list, starting from 0.
# Selecting the row at position 1 (which has index 1 in this case) print(df.iloc[1])
Output:
Name Anna Age 24 City Paris Name: 1, dtype: object
Callable function
A callable function can be used to select rows/columns. This function takes a DataFrame as input and outputs a DataFrame, Series, or index.
# Selecting rows where 'Age' is greater than 30 print(df.loc[lambda df: df['Age'] > 30])
Output:
Name Age City 2 Peter 35 Berlin 3 Linda 32 London
query
Method
The .query()
method allows you to filter the data using a boolean expression. This method is easier to read when dealing with complex queries.
# Selecting rows where 'Age' is greater than 30 print(df.query('Age > 30'))
Output:
Name Age City 2 Peter 35 Berlin 3 Linda 32 London
where
Method
The where
method is used to replace values where a condition is false. This is useful when you want to maintain the shape of the DataFrame, as it returns a DataFrame of the same shape.
# Using 'where' to replace ages less than 30 with NaN print(df.where(df['Age'] > 30))
Output:
Name Age City 0 NaN NaN NaN 1 NaN NaN NaN 2 Peter 35.0 Berlin 3 Linda 32.0 London
Read more about Pandas where() method.
get
Method
The get
method returns a specific column from the DataFrame. If the column does not exist, it will return a default value (None if not specified).
# Get the 'Age' column print(df.get('Age')) # Try to get a non-existing column, return default 'Column not found' print(df.get('Salary', 'Column not found'))
Output:
0 28 1 24 2 35 3 32 Name: Age, dtype: int64 Column not found
at
Method
The at
method is used to retrieve a single value at a particular row and column pair that are passed as arguments. The row and column labels must be specified.
# Use 'at' to get the name of the person in the second row print(df.at[1, 'Name'])
Output:
Anna
iat
Method
The iat
method is similar to at
, but instead of labels, it uses integer-based position to access values. It is faster for accessing single values at a particular column and row.
# Use 'iat' to get the age of the person in the third row print(df.iat[2, 1])
Output:
35
isin
Method
The isin
method is used to filter DataFrames. It returns a boolean DataFrame showing whether each element in the DataFrame is contained in the values.
# Use 'isin' to find rows where 'City' is either 'Paris' or 'London' print(df[df['City'].isin(['Paris', 'London'])])
Output:
Name Age City 1 Anna 24 Paris 3 Linda 32 London
Read more about Pandas isin method.
Export DataFrame
Pandas not only provides functionality for importing data from various file formats, but it also allows you to export DataFrames back to these formats.
Export to CSV
The to_csv
function allows you to save a DataFrame as a CSV file.
df.to_csv('sample_export.csv', index=False)
With index=False
, we choose not to write row names (index).
Export to Excel
The to_excel
function allows you to save a DataFrame as an Excel file.
df.to_excel('sample_export.xlsx', index=False)
Read more on how to export Pandas DataFrame to Excel using to_excel.
Export to SQL
To export a DataFrame to a SQL database, you can use the to_sql
function.
import sqlite3 conn = sqlite3.connect('sample.db') df.to_sql('new_table', conn, if_exists='replace', index=False) conn.close()
In the to_sql
function, the if_exists
parameter determines the action to take if the table already exists. The options are ‘fail’, ‘replace’, and ‘append’.
Read more on how to export Pandas DataFrame to SQL using to_sql.
Export to JSON
The to_json
function allows you to save a DataFrame as a JSON file.
df.to_json('sample_export.json', orient='records')
The orient
parameter specifies the format of the JSON string. The ‘records’ option specifies that the JSON string will be written as a list of records.
With pandas, you’re able to export your DataFrame to various file formats depending on your needs, making it easier for you to share your findings or to use your processed data in different environments.
Read more on how to convert Pandas DataFrame to JSON using to_json.
Export to HTML
The to_html function allows you to export a DataFrame to an HTML table.
df.to_html('sample.html')
This will create an HTML file named data.html
in your current directory. The file will contain a table representation of the DataFrame.
Read more on how to convert Pandas DataFrame to HTML table using to_html.
Export to XML
The to_xml function allows you to export a DataFrame to XML format.
df.to_xml('sample.xml')
This will produce an XML file from the DataFrame.
Read more on how to Convert Pandas DataFrame to XML File Using to_xml.
Export to NumPy
You can use the to_numpy()
function to export Pandas DataFrame to NumPy.
Let’s start by ensuring Pandas and numpy are imported:
import pandas as pd import numpy as np
Next, let’s create a sample DataFrame:
data = { 'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 24, 35, 32], 'City': ['New York', 'Paris', 'Berlin', 'London'] } df = pd.DataFrame(data)
To convert this DataFrame into a NumPy array, use the to_numpy()
function:
numpy_array = df.to_numpy()
The output, numpy_array
, is a 2D NumPy array with the DataFrame’s data.
print(numpy_array)
Output:
[['John' 28 'New York'] ['Anna' 24 'Paris'] ['Peter' 35 'Berlin'] ['Linda' 32 'London']]
Export to BigQuery
You can use the to_gbq
function from pandas
to export your DataFrame:
import pandas as pd df = pd.DataFrame({ 'a': [1, 2, 3], 'b': ['A', 'B', 'C'] }) project_id = 'YOUR_PROJECT_ID' table_id = 'YOUR_DATASET.YOUR_TABLE' if_exists = 'replace' # 'fail', 'replace', 'append' are the options df.to_gbq(table_id, project_id=project_id, if_exists=if_exists)
Read more about how to Export Pandas DataFrame to Google BigQuery using to_gbq.
Pickling & Unpickling
You can use to_pickle() and read_pickle() to pickle or unpickle DataFrames or Series. Let’s illustrate this with a simple DataFrame:
import pandas as pd import numpy as np df = pd.DataFrame({ 'A': 1., 'B': pd.Timestamp('20230102'), 'C': pd.Series(1, index=list(range(4)), dtype='float32'), 'D': np.array([3] * 4, dtype='int32'), 'E': pd.Categorical(["test", "train", "test", "train"]), 'F': 'foo' }) print(df)
This will output:
A B C D E F 0 1.0 2023-01-02 1.0 3 test foo 1 1.0 2023-01-02 1.0 3 train foo 2 1.0 2023-01-02 1.0 3 test foo 3 1.0 2023-01-02 1.0 3 train foo
Now, let’s pickle this DataFrame:
df.to_pickle('df.pkl')
This will create a pickled file named ‘df.pkl’ in your working directory. To load this pickled file back into a DataFrame (Unpickle), you can use:
df_new = pd.read_pickle('df.pkl')
print(df_new)
Handling Other File Formats
While CSV, Excel, SQL, and JSON are some of the most common formats you’ll encounter in data science. Let’s discuss a few of them:
HDF5 Files
Hierarchical Data Format (HDF) is designed to store and organize large amounts of data. The read_hdf
function can read HDF5 files, while the to_hdf
function can write to them.
df_hdf = pd.read_hdf('sample.h5') df.to_hdf('sample.h5', key='df')
DataFrame Filtering
You can filter data based on specific conditions to focus on a subset of the data. Here’s how to do it:
import pandas as pd df = pd.DataFrame({ 'Age': [22, 25, 30, 35, 40, 45, 50, 55, 60, 65], 'Salary': [30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000] }) df_filtered = df[df['Age'] > 40]
The df['Age'] > 40
operation returns a Series of True/False values (a boolean mask), which we then use to index into the DataFrame. This operation returns a new DataFrame containing only the rows where the mask values are True
.
print(df_filtered)
Age Salary 5 45 55000 6 50 60000 7 55 65000 8 60 70000 9 65 75000
As you can see, the resulting DataFrame only includes the rows where the ‘Age’ is greater than 40. You can use any condition that returns a boolean mask to filter your DataFrame in this way.
Data Cleaning
Data cleaning is a crucial step in any data analysis process, and using a library like Pandas can significantly simplify this process.
Handling Missing Values
Let’s first identify missing values in our DataFrame:
print(df.isnull().sum())
Output:
ID 0 Age 1 Gender 0 Occupation 0 Salary 0 Experience 1 dtype: int64
The isnull
function returns a DataFrame where each cell is either True (if the original cell contained a missing value) or False. When we sum each column, we get the number of missing values in each column.
There are several strategies to handle missing data:
Deleting Rows
One approach is to simply delete rows with missing values using the dropna
function.
df_drop = df.dropna() print(df_drop.isnull().sum())
Filling Missing Values
Alternatively, you could fill missing values with a test statistic like mean, median, or mode using the fillna
function.
df_filled = df.fillna(df.mean()) print(df_filled.isnull().sum())
In this case, Pandas calculates the mean of the ‘Age’ and ‘Experience’ columns and uses these values to fill the missing values in the respective columns.
Removing Duplicates
Duplicate entries in your dataset can distort your analysis and lead to incorrect conclusions. Fortunately, Pandas provides easy-to-use functionality to detect and remove duplicate rows.
Let’s first check for duplicates in our DataFrame:
print(df.duplicated().sum())
The duplicated
function returns a Boolean series that is True if a row is a duplicate.
If duplicates exist, you have to remove them for accurate results. Here’s how you can do it:
df_unique = df.drop_duplicates() print(df_unique.duplicated().sum())
Output:
0
The drop_duplicates
function returns a copy of the DataFrame with the duplicate rows removed.
You can then confirm that duplicates have been removed by checking duplicated().sum()
on the new DataFrame.
It’s worth mentioning that drop_duplicates
by default considers all columns.
However, if you want to specifically remove duplicates based on certain columns, you can do so by passing those column names in the function as follows:
df_unique = df.drop_duplicates(subset=['Gender', 'Occupation'])
This will remove duplicates where the combination of ‘Gender’ and ‘Occupation’ is the same.
String Manipulation
Pandas provides a robust set of string operations, making it easy to manipulate data stored as text. These operations can be accessed via the .str
attribute of Series and Indexes.
Let’s explore a few common string manipulations. Let’s assume our DataFrame df
has a column named ‘Occupation’.
Convert to Lower or Upper Case
If you want to convert the ‘Occupation’ column to lowercase or uppercase, you can use the lower
or upper
method:
df['Occupation'] = df['Occupation'].str.lower() df['Occupation'] = df['Occupation'].str.upper()
Splitting Strings
Sometimes you may need to split a string column into multiple columns. You can use the split
method for this. For instance, if ‘Occupation’ had values like ‘Data_Analyst’, and you wanted to separate ‘Data’ and ‘Analyst’, you can do it like this:
df[['First','Last']] = df['Occupation'].str.split('_', expand=True)
This creates two new columns, ‘First’ and ‘Last’, containing the strings before and after the underscore, respectively.
Replacing Substrings
You can use the replace
method to replace a specific substring with another one. For example, if you wanted to replace ‘Data’ with ‘Information’ in ‘Occupation’, you could do:
df['Occupation'] = df['Occupation'].str.replace('Data', 'Information')
Extracting Substrings
You can use the extract
method to extract groups from the first match of a regular expression:
df['Extracted'] = df['Occupation'].str.extract('(Data)')
This will create a new column, ‘Extracted’, containing ‘Data’ wherever it was found in ‘Occupation’.
Remember, these are just a few examples.
Changing Data Types
Pandas provides functionality to change the data types of your DataFrame, which is often required as part of data cleaning and preparation.
First, let’s check the current data types of our DataFrame:
print(df.dtypes)
This will provide an output displaying each column name with its corresponding data type.
Now, suppose we have a column ‘Age’ in our DataFrame, which is currently stored as ‘float64’ but we want it as an integer (i.e., ‘int64’). Here’s how you can do it:
df['Age'] = df['Age'].astype('int64') print(df.dtypes)
Here, we’ve used the astype
function to change the data type.
Remember, when converting to integer types, make sure that your column doesn’t have any missing values because ‘NaN’ values are considered as floats by pandas. If there are missing values, you may want to fill or drop them before changing the data type.
Additionally, you can convert numeric data to categorical data types. This is especially useful in machine learning when you’re working with categorical variables.
df['Age'] = df['Age'].astype('category') print(df.dtypes)
Aggregating Data
Let’s assume that our DataFrame has columns: ‘Gender’, ‘Occupation’, and ‘Salary’. We want to calculate the average and maximum salary, and the total number of individuals for each ‘Gender’ and ‘Occupation’. Here’s how we can do it:
grouped = df.groupby(['Gender', 'Occupation']) agg_data = grouped['Salary'].agg(['mean', 'max', 'count']) print(agg_data)
Output:
mean max count Gender Occupation Female Data Analyst 59000 65000 10 Data Scientist 61000 67000 12 Male Data Analyst 63000 69000 11 Data Scientist 67000 72000 15
Here, the agg
function is used to perform multiple aggregations at once. It takes a list of functions as input and applies each of them to the ‘Salary’ column of each group.
Grouping Data
One of the powerful functionalities provided by pandas is grouping data based on certain criteria, similar to the “GROUP BY” statement in SQL.
This allows us to analyze subsets of the data independently.
Suppose we have a column ‘Gender’ in our DataFrame, and we’d like to group our data by this variable:
grouped = df.groupby('Gender')
Once the data is grouped, we can apply aggregation functions like mean
, sum
, max
, etc. to these groups. For example, if you wanted to know the average salary of each gender, you could do:
mean_salary = grouped['Salary'].mean() print(mean_salary)
Output:
Gender Female 60000 Male 65000 dtype: int64
Here, the groupby
function is used to separate the data into groups based on ‘Gender’. Then, we calculate the mean salary for each group using the mean
function.
We can also group by multiple columns. For example, if you wanted to group by both ‘Gender’ and ‘Occupation’, you would do:
grouped = df.groupby(['Gender', 'Occupation']) mean_salary = grouped['Salary'].mean() print(mean_salary)
Output:
Gender Occupation Female Data Analyst 59000 Data Scientist 61000 Male Data Analyst 63000 Data Scientist 67000 dtype: int64
Pivot Tables
Pivot tables are used to summarize, analyze, explore, and present summary statistics of your data. They are especially helpful when you have long data that needs to be reshaped for specific analyses.
Pandas provides a method ‘pivot_table’ to create MS Excel spreadsheet-style pivot tables.
For example, suppose we have a DataFrame df
with columns ‘Gender’, ‘Occupation’, and ‘Salary’. If we wanted to create a pivot table to view the average salary by gender and occupation, we could do:
pivot = df.pivot_table(values='Salary', index='Occupation', columns='Gender', aggfunc='mean') print(pivot)
Output:
Gender Female Male Occupation Data Analyst 59000.0 63000.0 Data Scientist 61000.0 67000.0
Here, the pivot_table
function takes three arguments:
values
: This is the column that you want to summarize. In this case, it’s the ‘Salary’ column.index
: The column(s) that you want to set as the index of the output table. Here, it’s the ‘Occupation’ column.columns
: The column(s) that you want to convert into columns in the output table. Here, it’s the ‘Gender’ column.aggfunc
: The aggregation function that you want to apply. Here, we’ve used ‘mean’ to calculate the average salary.
Correlations
Pandas provides a method, corr
, to calculate correlations between columns in a DataFrame.
Let’s say we have a DataFrame df
that includes columns ‘Age’, ‘Salary’, and ‘Experience’. To find out the correlation between these columns, we can use:
correlation = df[['Age', 'Salary', 'Experience']].corr() print(correlation)
Output:
Age Salary Experience Age 1.0000 0.7467 0.9801 Salary 0.7467 1.0000 0.7698 Experience 0.9801 0.7698 1.0000
This is a correlation matrix, where each cell in the table shows the correlation between two variables. A correlation of 1 indicates a perfect positive correlation, while -1 indicates a perfect negative correlation. A correlation of 0 means that no relationship exists.
We used the corr
method, which by default calculates the Pearson correlation. Other correlation methods like Spearman and Kendall can also be specified using the method
parameter.
Descriptive Statistics
Descriptive statistics summarize and organize characteristics of a dataset. Pandas provides several functions that make it easy to explore and understand the data at a high level.
Suppose we have a DataFrame with columns ‘Age’, ‘Salary’, and ‘Experience’. If we want to get some quick descriptive statistics for these columns, we can use the describe
function:
desc_stats = df[['Age', 'Salary', 'Experience']].describe() print(desc_stats)
Output:
Age Salary Experience count 100.000000 100.000000 100.000000 mean 29.710000 65000.000000 7.500000 std 6.982456 7055.473572 2.293098 min 22.000000 50000.000000 3.000000 25% 25.000000 60000.000000 6.000000 50% 28.000000 65000.000000 7.000000 75% 32.000000 70000.000000 9.000000 max 45.000000 80000.000000 13.000000
Here, count
is the number of non-NA/null values, mean
is the mean of the values, std
is the standard deviation, min
and max
are the minimum and maximum values respectively. 25%
, 50%
, and 75%
are the lower quartile, median, and upper quartile respectively.
Merging DataFrames
You can combine DataFrames in Pandas using the merge
function. Let’s consider two sample DataFrames:
import pandas as pd df1 = pd.DataFrame({ 'EmployeeID': ['1', '2', '3', '4'], 'Name': ['Amy', 'Bob', 'Charlie', 'Dave'], 'Department': ['HR', 'Sales', 'IT', 'Marketing'] }) df2 = pd.DataFrame({ 'EmployeeID': ['3', '4', '5', '6'], 'Age': [25, 30, 35, 40], 'Salary': [70000, 80000, 90000, 100000] })
If you want to merge these DataFrames based on the ‘EmployeeID’, you can do it like this:
df = pd.merge(df1, df2, on='EmployeeID') print(df)
Output:
EmployeeID Name Department Age Salary 0 3 Charlie IT 25 70000 1 4 Dave Marketing 30 80000
Here, Pandas matched up the ‘EmployeeID’ values across the two DataFrames and joined them together. By default, merge
performs an inner join, which means that only the keys that are present in both DataFrames are combined together.
If you want to include all the keys from both DataFrames, you can specify how='outer'
:
df = pd.merge(df1, df2, on='EmployeeID', how='outer') print(df)
Output:
EmployeeID Name Department Age Salary 0 1 Amy HR NaN NaN 1 2 Bob Sales NaN NaN 2 3 Charlie IT 25.0 70000.0 3 4 Dave Marketing 30.0 80000.0 4 5 NaN NaN 35.0 90000.0 5 6 NaN NaN 40.0 100000.0
Concatenating DataFrames
Concatenation is another way to combine DataFrames in pandas. This is typically used when you want to append rows or columns from one DataFrame to another. Let’s consider two sample DataFrames:
df1 = pd.DataFrame({ 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}, index=[0, 1, 2, 3]) df2 = pd.DataFrame({ 'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7'], 'C': ['C4', 'C5', 'C6', 'C7'], 'D': ['D4', 'D5', 'D6', 'D7']}, index=[4, 5, 6, 7])
If you want to concatenate these DataFrames along the row axis (i.e., append df2 as new rows to df1), you can do:
df = pd.concat([df1, df2]) print(df)
Output:
A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7
By default, concat
joins along axis=0 (i.e., it appends as rows). If you want to concatenate along the column axis (i.e., append df2 as new columns to df1), you can specify axis=1
:
df = pd.concat([df1, df2], axis=1)
Concatenating DataFrames is often useful in data wrangling where you have data in multiple pieces and you want to combine them into a single DataFrame for further analysis.
Reshaping DataFrames
You can use the melt
and pivot
functions to reshape your DataFrames.
Melt
You can use the melt
function to transform or reshape data into a format where one or more columns are identifier variables, while all other columns are considered measured variables.
Let’s consider a sample DataFrame:
import pandas as pd df = pd.DataFrame({ 'Date': ['2023-06-01', '2023-06-01', '2023-06-02', '2023-06-02'], 'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'], 'Temperature': [75, 77, 76, 78], 'Humidity': [80, 70, 78, 85] }) print(df)
Output:
Date City Temperature Humidity 0 2023-06-01 New York 75 80 1 2023-06-01 Los Angeles 77 70 2 2023-06-02 New York 76 78 3 2023-06-02 Los Angeles 78 85
If you want to melt this DataFrame, you can do:
melted = df.melt(id_vars=['Date', 'City'], var_name='Measurement', value_name='Value') print(melted)
Output:
Date City Measurement Value 0 2023-06-01 New York Temperature 75 1 2023-06-01 Los Angeles Temperature 77 2 2023-06-02 New York Temperature 76 3 2023-06-02 Los Angeles Temperature 78 4 2023-06-01 New York Humidity 80 5 2023-06-01 Los Angeles Humidity 70 6 2023-06-02 New York Humidity 78 7 2023-06-02 Los Angeles Humidity 85
Pivot
The pivot
function is the inverse of melt
. It allows you to transform or reshape data, by turning unique values from one column into separate columns in the DataFrame, and then fills the corresponding cells with the appropriate values.
Continuing with the melted
DataFrame from the previous example, if you want to pivot it back into its original form, you can do it like this:
pivoted = melted.pivot(index=['Date', 'City'], columns='Measurement', values='Value') print(pivoted.reset_index())
Output:
Date City Measurement Value 0 2023-06-01 New York Temperature 75 1 2023-06-01 Los Angeles Temperature 77 2 2023-06-02 New York Temperature 76 3 2023-06-02 Los Angeles Temperature 78 4 2023-06-01 New York Humidity 80 5 2023-06-01 Los Angeles Humidity 70 6 2023-06-02 New York Humidity 78 7 2023-06-02 Los Angeles Humidity 85
Column Operations
You can perform several operations on columns such as selecting, renaming, creating, and deleting columns. Let’s consider a sample DataFrame:
import pandas as pd df = pd.DataFrame({ 'Name': ['Amy', 'Bob', 'Charlie', 'Dave'], 'Age': [23, 25, 22, 21], 'Salary': [70000, 80000, 90000, 85000] })
Selecting Columns
You can select a single column by using bracket notation and the column’s name:
names = df['Name'] print(names)
Output:
0 Amy 1 Bob 2 Charlie 3 Dave Name: Name, dtype: object
Renaming Columns
To rename a column, you can use the rename
function:
df = df.rename(columns={'Name': 'Employee Name'}) print(df)
Creating Columns
To create a new column, you can define it in terms of existing columns:
df['Age Next Year'] = df['Age'] + 1 print(df)
Deleting Columns
To delete a column, you can use the drop
function:
df = df.drop(columns='Age Next Year') print(df)
Date and Time Handling
To start, let’s create a sample DataFrame with date strings:
import pandas as pd df = pd.DataFrame({ 'Name': ['Amy', 'Bob', 'Charlie', 'Dave'], 'DOB': ['1995-05-01', '1996-07-23', '1997-08-15', '1998-01-10'] })
Converting Strings to Datetime
Often, dates come as strings, and it would be beneficial to convert these into datetime objects to perform date-related operations:
df['DOB'] = pd.to_datetime(df['DOB']) print(df['DOB'].dtype) # Output: datetime64[ns]
Extracting Date Components
Once we have a datetime column, we can extract various components like the year, month, day, etc.:
df['Year'] = df['DOB'].dt.year print(df)
Time Delta
Time delta represents the duration or the difference in time. Let’s assume you have data with timestamps and you want to calculate the time difference between events:
import pandas as pd time_data = pd.Series(pd.date_range('2023-01-01', periods=3, freq='D')) print(time_data)
Output:
0 2023-01-01 1 2023-01-02 2 2023-01-03 dtype: datetime64[ns]
To get the time difference (Timedelta) between these dates, you can use the diff()
function:
time_delta = time_data.diff() print(time_delta)
Output:
0 NaT 1 1 days 2 1 days dtype: timedelta64[ns]
Here, the diff()
function calculates the difference between the current and a prior element in the Series.
As there is no prior element for the first timestamp, the result is NaT
(Not a Time), which is the pandas representation for a missing or null time data.
Changing Date Format
You can also change the format of your dates using the strftime
function:
df['DOB'] = df['DOB'].dt.strftime('%d/%m/%Y') print(df)
Data Visualization
Pandas provides data visualization through different functions, you can plot data in many ways depending on your needs.
Line Plot
Line charts are among the most frequently used chart types in data visualization. They are especially useful for visualizing changes in values over time.
You can use the DataFrame.plot.line()
function to generate line plots in pandas. Let’s consider a DataFrame with monthly sales data:
import pandas as pd df = pd.DataFrame({ 'Month': pd.date_range(start='01-01-2023', periods=12, freq='M'), 'Sales': [15, 18, 20, 22, 25, 23, 30, 33, 40, 45, 50, 55] }) df.set_index('Month', inplace=True)
Now, you can plot the sales data:
df.plot.line(title='Monthly Sales')
This will produce a line chart with the months on the x-axis and the sales values on the y-axis.
Remember, if you’re using Jupyter notebook, you don’t need to do anything special to view the plot, but if you’re using a different environment, you should use Matplotlib to display the plot otherwise, the plot won’t be visible.
import matplotlib.pyplot as plt
df.plot.line(title='Monthly Sales')
plt.show()
Bar Plot
You can use the DataFrame.plot.bar()
function to create bar charts. Let’s consider a DataFrame with sales data across different categories:
import pandas as pd df = pd.DataFrame({ 'Category': ['Electronics', 'Clothing', 'Home & Kitchen', 'Health & Beauty'], 'Sales': [15000, 12000, 18000, 20000] }) df.set_index('Category', inplace=True)
Now, to generate a bar chart:
df.plot.bar(title='Sales by Category', color='blue', alpha=0.75)
In the resulting bar chart, each category (Electronics, Clothing, Home & Kitchen, Health & Beauty) has a bar associated with it, and the length of the bar corresponds to the amount of sales in that category.
We used the color
argument to set the color of the bars, and the alpha
argument sets the transparency of the bars. The title of the chart is set using the title
argument.
Scatter Plot
Scatter plots are used to display the relationship between two numerical variables. With pandas, you can use the DataFrame.plot.scatter()
function to create scatter plots. Let’s consider a DataFrame with age and salary data:
import pandas as pd df = pd.DataFrame({ 'Age': [22, 25, 30, 35, 40, 45, 50, 55, 60, 65], 'Salary': [30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000] })
To generate a scatter plot to examine the relationship between age and salary:
df.plot.scatter(x='Age', y='Salary', title='Age vs Salary')
In the scatter plot, each point represents a single individual’s age (x-axis) and salary (y-axis). If there’s a clear pattern in the distribution of points, it suggests a relationship between the two variables.
Histogram
A histogram is a graphical representation that organizes a group of data points into a specified range.
In Pandas, you can use the DataFrame.plot.hist()
function to create histograms. Let’s consider a DataFrame with age data:
import pandas as pd df = pd.DataFrame({ 'Age': [22, 25, 22, 35, 40, 35, 22, 40, 35, 22, 30, 25, 30, 25, 40, 30, 35, 40] })
To generate a histogram to visualize the distribution of ages:
df['Age'].plot.hist(bins=6, alpha=0.5, title='Age Distribution')
In this histogram, the ages (x-axis) are grouped into bins (ranges of ages), and the number of occurrences of each range is shown on the y-axis. The bins
argument sets the number of bins in the histogram. The alpha
argument sets the transparency of the bars.
In this case, the histogram will show that the ages are mostly clustered around 30-35.
Box Plot
You can use the DataFrame.plot.box()
function to create box plots. Let’s consider a DataFrame with salary data:
import pandas as pd df = pd.DataFrame({ 'Salary': [35000, 40000, 36000, 41000, 39000, 38000, 37000, 42000, 43000, 45000, 44000, 50000, 55000] })
Now, to generate a box plot of the salary data:
df['Salary'].plot.box(title='Salary Distribution')
In this box plot, the box represents the interquartile range (IQR), i.e., 25th to 75th percentile of the data, the line inside the box is the median (50th percentile), and the whiskers typically represent the range for the rest of the data.
Any points outside of the whiskers can be considered outliers.
Pandas comes to the rescue
I was approached by a large US retail company for a rather complex data analysis project.
They were dealing with sales data spread across multiple Excel and CSV files, each containing hundreds of thousands of rows and more than 40 columns of data.
The data files were located on their cloud storage and were updated daily. My task was to create a system that could efficiently handle this data, perform numerous transformations, and churn out insightful reports daily.
At first, I took a stab at the problem the old-fashioned way: I wrote some Python code that utilized the built-in csv
library to read and parse the data.
I quickly ran into issues: the sheer amount of data was too large for efficient processing in regular Python.
Even simple tasks, like calculating averages or filtering data based on conditions, became painfully slow.
It was taking roughly 2 hours to process a single file, meaning a full day’s worth of files could potentially take over 48 hours! I quickly realized that this approach was unworkable.
At this moment, I turned to Pandas. A mighty tool indeed. Using it, I could read files directly from the cloud storage, perform necessary transformations on the entire dataset at once, and handle missing or wrong data, all while enjoying the benefits of its highly optimized performance.
With Pandas, reading in a single file took a mere 2 minutes, down from the initial 2 hours – that’s a 60x speed improvement!
The average time required to process a file, including all the transformations and calculations, was around 15 minutes.
That meant I could process an entire day’s worth of files in less than 12 hours, rather than the originally projected 48 hours – slashing the total time by 75%.
Furthermore, the code was far cleaner and more maintainable using Pandas. Because it allowed operations on entire data structures at once, the number of lines in the code was reduced by approximately 65%.
Another significant advantage of using Pandas was its robust handling of missing or incorrect data. Other methods like Python’s csv
would’ve required complex and time-consuming error handling for each individual operation. With Pandas, this was handled almost automatically, saving an additional estimated 20% in overall development time.
My time spent on the project decreased by nearly 80%, and the client was delighted with the efficiency and accuracy of the daily reports.
Ultimately, it was a stark reminder: in the realm of data science, the right tools can make all the difference.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.