Python Pandas tutorial (beginners guide to data analysis)

Pandas, an open-source, high-performance library that is used for various purposes such as data analysis, machine learning, and feature engineering.

One of the critical tools in a data scientist’s arsenal is the Pandas library, a powerful Python package for data analysis and manipulation.

 

Table of Contents hide

 

Python Pandas Installation

Open your command line (terminal for Mac/Linux, command prompt for Windows) and type the following command:

pip install pandas

If you are using a Python environment like Anaconda, you can use the conda package manager to install pandas:

conda install pandas

After you’ve installed it, you can import pandas in your Python script like so:

import pandas as pd

It’s customary to import pandas under the alias pd for brevity.

 

Pandas Data Structures

Pandas provides mainly 2 data structures – the DataFrame and the Series. These two structures are built on top of NumPy, another powerful Python package, making data manipulation in pandas a flexible and efficient task.
Let’s take a quick look at these data structures:

DataFrames

A DataFrame is a two-dimensional labeled data structure with columns of different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. They are generally the most commonly used Pandas object.

import pandas as pd
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)

Output:

Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London

The first row of numbers (0 to 3) is the index of the DataFrame, and it was automatically generated.

 

Series

A Series is a one-dimensional labeled array that can hold any data type. It’s essentially a single column of a DataFrame, and the axis labels are collectively referred to as the index.

Here is how you can create a Pandas Series from Python’s built-in data types – list, dictionary, and tuple:

import pandas as pd
import numpy as np
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

Output:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

We just created a pandas Series from a list, which can contain any data type. Note that np.nan was used to represent a missing value.

 

Import CSV Files

Assuming we have a CSV file named ‘sample.csv’ in the same directory as our Python script, we can import this data as follows:

import pandas as pd
df = pd.read_csv('sample.csv')
print(df.head())

Assume the output as follows:

   ID  Age Gender   Occupation  Salary  Experience
0   1   35      M     Engineer   50000        10.0
1   2   42      F     Scientist  65000        12.0
2   3   30      M    Astronaut   82000         9.0
3   4   25      F  Data Analyst  62000         5.0
4   5   NaN      M   Technician  48000         NaN

The read_csv reads CSV files and converts them into a DataFrame. The head() function retrieves the first five rows from the DataFrame, which is useful for quickly testing if your object has the right type of data in it.

The sample data includes columns for ID, Age, Gender, Occupation, Salary, and Experience.

Note that there are some missing values in the Age and Experience columns and some duplicates which we will handle in the upcoming sections.

 

Import Excel Files

Pandas is also capable of importing data from Excel files. If your data is in an Excel file, you can read it into a pandas DataFrame using the read_excel function. Let’s see how to do this.

import pandas as pd
df_excel = pd.read_excel('sample.xlsx')
print(df_excel.head())

Output:

   ID  Age Gender   Occupation  Salary  Experience
0   1   35      M     Engineer   50000        10.0
1   2   42      F     Scientist  65000        12.0
2   3   30      M    Astronaut   82000         9.0
3   4   25      F  Data Analyst  62000         5.0
4   5   NaN      M   Technician  48000         NaN

The read_excel function reads an Excel file into a pandas DataFrame. Similar to read_csv.

The data structure and values from the Excel file are similar to the previous CSV example, complete with missing values.

 

Import JSON Files

Another common data format that you might encounter is JSON (JavaScript Object Notation), especially when working with data from web services and APIs.

Pandas provides a function, read_json, to easily import JSON files into a pandas DataFrame.
Let’s import a JSON file named ‘sample.json’:

import pandas as pd
df_json = pd.read_json('sample.json')
print(df_json.head())

Output (depends on your file):

   ID  Age Gender   Occupation  Salary  Experience
0   1   35      M     Engineer   50000        10.0
1   2   42      F     Scientist  65000        12.0
2   3   30      M    Astronaut   82000         9.0
3   4   25      F  Data Analyst  62000         5.0
4   5   NaN      M   Technician  48000         NaN

The read_json function reads JSON file into a DataFrame. As with the previous example.

Read more on how to read JSON files using Pandas read_json.

 

Import SQL Files

Pandas provides the read_sql function, which allows you to execute a SQL query and get the result as a Pandas DataFrame.
Let’s suppose you have a SQLite database (named ‘sample.db’) and you want to read data from a table called ’employees’.
First, you need to establish a connection with the database. Python has several libraries to connect with various databases – we’ll use sqlite3 for SQLite database:

import pandas as pd
import sqlite3

conn = sqlite3.connect('sample.db')
sql_query = "SELECT * FROM employees"
df_sql = pd.read_sql(sql_query, conn)
conn.close()
print(df_sql.head())

The output will be:

   ID  Age Gender   Occupation  Salary  Experience
0   1   35      M     Engineer   50000        10.0
1   2   42      F     Scientist  65000        12.0
2   3   30      M    Astronaut   82000         9.0
3   4   25      F  Data Analyst  62000         5.0
4   5   NaN      M   Technician  48000         NaN

Read more on how to read SQL query into DataFrame using read_sql.

 

Import HTML

Pandas also provides functionality to read HTML tables directly into a DataFrame using the read_html() function.

It searches for table elements and only for tr, th, and td tags within each table.

You will also need to have lxml, beautifulsoup4 and html5lib Python packages installed. You can install them via pip:

!pip install lxml beautifulsoup4 html5lib

Let’s consider a simple example where we’re trying to fetch tables from a Wikipedia page.

import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
dfs = pd.read_html(url)
print(f"Total tables: {len(dfs)}")
print("First table:")
print(dfs[0])

Output:

Total tables: 5
First table:
   Rank                                     Name  ... State-owned  Ref.
   Rank                                     Name  ... State-owned  Ref.
0     1                                  Walmart  ...         NaN   [1]
1     2                         Amazon.com, Inc.  ...         NaN   [4]
...
...

Read more on how to read HTML tables using Pandas read_html.

 

Import XML Files

Suppose you have a file named shapes.xml:

<data>
    <row>
        <shape>square</shape>
        <degrees>360</degrees>
        <sides>4.0</sides>
    </row>
    <row>
        <shape>circle</shape>
        <degrees>360</degrees>
    </row>
</data>

You can read this file into a DataFrame like this:

import pandas as pd
df = pd.read_xml("shapes.xml")
print(df)

Output:

    shape  degrees  sides
0  square      360    4.0
1  circle      360    NaN

Read more about Parsing XML Files into DataFrames using Pandas read_xml.

 

Import Parquet Files

To read a parquet file, you will use the read_parquet function from pandas. This function receives as argument the path to the parquet file you want to read.

import pandas as pd
df = pd.read_parquet('data.parquet')

The output of the read_parquet function is a DataFrame. All the data from your parquet file is now loaded into this DataFrame, and you can start manipulating and analyzing it.

To be able to use the read_parquet function, you will need to have the Pyarrow or fastparquet library installed.

pip install pyarrow fastparquet

Read more on how to read Parquet files using Pandas read_parquet.

 

Selecting and Indexing

In Pandas, there are several ways to select and index data. Let’s dive into some of these using the given sample data.

import pandas as pd
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)

Slicing Ranges

We can use Python’s list slicing syntax directly on a DataFrame to select a range of rows.

# Selecting rows 1 through 2 (end index is exclusive)
print(df[1:3])

Output:

    Name  Age    City
1   Anna   24   Paris
2  Peter   35  Berlin

Using Label (.loc)

The .loc property is used with the label of the rows or columns. Here, labels are the index values.

# Selecting the row with index 1
print(df.loc[1])

Output:

Name     Anna
Age        24
City    Paris
Name: 1, dtype: object

Using Integer Location (.iloc)

You can use the .iloc property with the integer location to select a row. Here, the location is like the index in a Python list, starting from 0.

# Selecting the row at position 1 (which has index 1 in this case)
print(df.iloc[1])

Output:

Name     Anna
Age        24
City    Paris
Name: 1, dtype: object

 

Callable function

A callable function can be used to select rows/columns. This function takes a DataFrame as input and outputs a DataFrame, Series, or index.

# Selecting rows where 'Age' is greater than 30
print(df.loc[lambda df: df['Age'] > 30])

Output:

    Name  Age    City
2  Peter   35  Berlin
3  Linda   32  London

query Method

The .query() method allows you to filter the data using a boolean expression. This method is easier to read when dealing with complex queries.

# Selecting rows where 'Age' is greater than 30
print(df.query('Age > 30'))

Output:

    Name  Age    City
2  Peter   35  Berlin
3  Linda   32  London

where Method

The where method is used to replace values where a condition is false. This is useful when you want to maintain the shape of the DataFrame, as it returns a DataFrame of the same shape.

# Using 'where' to replace ages less than 30 with NaN
print(df.where(df['Age'] > 30))

Output:

    Name   Age    City
0    NaN   NaN     NaN
1    NaN   NaN     NaN
2  Peter  35.0  Berlin
3  Linda  32.0  London

Read more about Pandas where() method.

get Method

The get method returns a specific column from the DataFrame. If the column does not exist, it will return a default value (None if not specified).

# Get the 'Age' column
print(df.get('Age'))

# Try to get a non-existing column, return default 'Column not found'
print(df.get('Salary', 'Column not found'))

Output:

0    28
1    24
2    35
3    32
Name: Age, dtype: int64
Column not found

at Method

The at method is used to retrieve a single value at a particular row and column pair that are passed as arguments. The row and column labels must be specified.

# Use 'at' to get the name of the person in the second row
print(df.at[1, 'Name'])

Output:

Anna

iat Method

The iat method is similar to at, but instead of labels, it uses integer-based position to access values. It is faster for accessing single values at a particular column and row.

# Use 'iat' to get the age of the person in the third row
print(df.iat[2, 1])

Output:

35

isin Method

The isin method is used to filter DataFrames. It returns a boolean DataFrame showing whether each element in the DataFrame is contained in the values.

# Use 'isin' to find rows where 'City' is either 'Paris' or 'London'
print(df[df['City'].isin(['Paris', 'London'])])

Output:

    Name  Age    City
1   Anna   24   Paris
3  Linda   32  London

Read more about Pandas isin method.

 

Export DataFrame

Pandas not only provides functionality for importing data from various file formats, but it also allows you to export DataFrames back to these formats.

Export to CSV

The to_csv function allows you to save a DataFrame as a CSV file.

df.to_csv('sample_export.csv', index=False)

With index=False, we choose not to write row names (index).

Export to Excel

The to_excel function allows you to save a DataFrame as an Excel file.

df.to_excel('sample_export.xlsx', index=False)

Read more on how to export Pandas DataFrame to Excel using to_excel.

Export to SQL

To export a DataFrame to a SQL database, you can use the to_sql function.

import sqlite3
conn = sqlite3.connect('sample.db')
df.to_sql('new_table', conn, if_exists='replace', index=False)
conn.close()

In the to_sql function, the if_exists parameter determines the action to take if the table already exists. The options are ‘fail’, ‘replace’, and ‘append’.

Read more on how to export Pandas DataFrame to SQL using to_sql.

Export to JSON

The to_json function allows you to save a DataFrame as a JSON file.

df.to_json('sample_export.json', orient='records')

The orient parameter specifies the format of the JSON string. The ‘records’ option specifies that the JSON string will be written as a list of records.
With pandas, you’re able to export your DataFrame to various file formats depending on your needs, making it easier for you to share your findings or to use your processed data in different environments.

Read more on how to convert Pandas DataFrame to JSON using to_json.

Export to HTML

The to_html function allows you to export a DataFrame to an HTML table.

df.to_html('sample.html')

This will create an HTML file named data.html in your current directory. The file will contain a table representation of the DataFrame.

Read more on how to convert Pandas DataFrame to HTML table using to_html.

Export to XML

The to_xml function allows you to export a DataFrame to XML format.

df.to_xml('sample.xml')

This will produce an XML file from the DataFrame.

Read more on how to Convert Pandas DataFrame to XML File Using to_xml.

 

Export to NumPy

You can use the to_numpy() function to export Pandas DataFrame to NumPy.

Let’s start by ensuring Pandas and numpy are imported:

import pandas as pd
import numpy as np

Next, let’s create a sample DataFrame:

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)

To convert this DataFrame into a NumPy array, use the to_numpy() function:

numpy_array = df.to_numpy()

The output, numpy_array, is a 2D NumPy array with the DataFrame’s data.

print(numpy_array)

Output:

[['John' 28 'New York']
 ['Anna' 24 'Paris']
 ['Peter' 35 'Berlin']
 ['Linda' 32 'London']]

 

Export to BigQuery

You can use the to_gbq function from pandas to export your DataFrame:

import pandas as pd
df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': ['A', 'B', 'C']
})
project_id = 'YOUR_PROJECT_ID'
table_id = 'YOUR_DATASET.YOUR_TABLE'
if_exists = 'replace'  # 'fail', 'replace', 'append' are the options
df.to_gbq(table_id, project_id=project_id, if_exists=if_exists)

Read more about how to Export Pandas DataFrame to Google BigQuery using to_gbq.

 

Pickling & Unpickling

You can use to_pickle() and read_pickle() to pickle or unpickle DataFrames or Series. Let’s illustrate this with a simple DataFrame:

import pandas as pd
import numpy as np
df = pd.DataFrame({
   'A': 1.,
   'B': pd.Timestamp('20230102'),
   'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   'D': np.array([3] * 4, dtype='int32'),
   'E': pd.Categorical(["test", "train", "test", "train"]),
   'F': 'foo'
})
print(df)

This will output:

     A          B    C  D      E    F
0  1.0 2023-01-02  1.0  3   test  foo
1  1.0 2023-01-02  1.0  3  train  foo
2  1.0 2023-01-02  1.0  3   test  foo
3  1.0 2023-01-02  1.0  3  train  foo

Now, let’s pickle this DataFrame:

df.to_pickle('df.pkl')

This will create a pickled file named ‘df.pkl’ in your working directory. To load this pickled file back into a DataFrame (Unpickle), you can use:

df_new = pd.read_pickle('df.pkl') 
print(df_new)

 

Handling Other File Formats

While CSV, Excel, SQL, and JSON are some of the most common formats you’ll encounter in data science. Let’s discuss a few of them:

HDF5 Files

Hierarchical Data Format (HDF) is designed to store and organize large amounts of data. The read_hdf function can read HDF5 files, while the to_hdf function can write to them.

df_hdf = pd.read_hdf('sample.h5')
df.to_hdf('sample.h5', key='df')

 

DataFrame Filtering

You can filter data based on specific conditions to focus on a subset of the data. Here’s how to do it:

import pandas as pd
df = pd.DataFrame({
    'Age': [22, 25, 30, 35, 40, 45, 50, 55, 60, 65],
    'Salary': [30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000]
})
df_filtered = df[df['Age'] > 40]

The df['Age'] > 40 operation returns a Series of True/False values (a boolean mask), which we then use to index into the DataFrame. This operation returns a new DataFrame containing only the rows where the mask values are True.

print(df_filtered)
   Age  Salary
5   45   55000
6   50   60000
7   55   65000
8   60   70000
9   65   75000

As you can see, the resulting DataFrame only includes the rows where the ‘Age’ is greater than 40. You can use any condition that returns a boolean mask to filter your DataFrame in this way.

 

Data Cleaning

Data cleaning is a crucial step in any data analysis process, and using a library like Pandas can significantly simplify this process.

Handling Missing Values

Let’s first identify missing values in our DataFrame:

print(df.isnull().sum())

Output:

ID            0
Age           1
Gender        0
Occupation    0
Salary        0
Experience    1
dtype: int64

The isnull function returns a DataFrame where each cell is either True (if the original cell contained a missing value) or False. When we sum each column, we get the number of missing values in each column.
There are several strategies to handle missing data:

Deleting Rows

One approach is to simply delete rows with missing values using the dropna function.

df_drop = df.dropna()
print(df_drop.isnull().sum())

Filling Missing Values

Alternatively, you could fill missing values with a test statistic like mean, median, or mode using the fillna function.

df_filled = df.fillna(df.mean())
print(df_filled.isnull().sum())

In this case, Pandas calculates the mean of the ‘Age’ and ‘Experience’ columns and uses these values to fill the missing values in the respective columns.

 

Removing Duplicates

Duplicate entries in your dataset can distort your analysis and lead to incorrect conclusions. Fortunately, Pandas provides easy-to-use functionality to detect and remove duplicate rows.
Let’s first check for duplicates in our DataFrame:

print(df.duplicated().sum())

The duplicated function returns a Boolean series that is True if a row is a duplicate.
If duplicates exist, you have to remove them for accurate results. Here’s how you can do it:

df_unique = df.drop_duplicates()
print(df_unique.duplicated().sum())

Output:

0

The drop_duplicates function returns a copy of the DataFrame with the duplicate rows removed.

You can then confirm that duplicates have been removed by checking duplicated().sum() on the new DataFrame.
It’s worth mentioning that drop_duplicates by default considers all columns.

However, if you want to specifically remove duplicates based on certain columns, you can do so by passing those column names in the function as follows:

df_unique = df.drop_duplicates(subset=['Gender', 'Occupation'])

This will remove duplicates where the combination of ‘Gender’ and ‘Occupation’ is the same.

 

String Manipulation

Pandas provides a robust set of string operations, making it easy to manipulate data stored as text. These operations can be accessed via the .str attribute of Series and Indexes.

Let’s explore a few common string manipulations. Let’s assume our DataFrame df has a column named ‘Occupation’.

Convert to Lower or Upper Case

If you want to convert the ‘Occupation’ column to lowercase or uppercase, you can use the lower or upper method:

df['Occupation'] = df['Occupation'].str.lower()
df['Occupation'] = df['Occupation'].str.upper()

Splitting Strings

Sometimes you may need to split a string column into multiple columns. You can use the split method for this. For instance, if ‘Occupation’ had values like ‘Data_Analyst’, and you wanted to separate ‘Data’ and ‘Analyst’, you can do it like this:

df[['First','Last']] = df['Occupation'].str.split('_', expand=True)

This creates two new columns, ‘First’ and ‘Last’, containing the strings before and after the underscore, respectively.

Replacing Substrings

You can use the replace method to replace a specific substring with another one. For example, if you wanted to replace ‘Data’ with ‘Information’ in ‘Occupation’, you could do:

df['Occupation'] = df['Occupation'].str.replace('Data', 'Information')

Extracting Substrings

You can use the extract method to extract groups from the first match of a regular expression:

df['Extracted'] = df['Occupation'].str.extract('(Data)')

This will create a new column, ‘Extracted’, containing ‘Data’ wherever it was found in ‘Occupation’.
Remember, these are just a few examples.

 

Changing Data Types

Pandas provides functionality to change the data types of your DataFrame, which is often required as part of data cleaning and preparation.
First, let’s check the current data types of our DataFrame:

print(df.dtypes)

This will provide an output displaying each column name with its corresponding data type.
Now, suppose we have a column ‘Age’ in our DataFrame, which is currently stored as ‘float64’ but we want it as an integer (i.e., ‘int64’). Here’s how you can do it:

df['Age'] = df['Age'].astype('int64')
print(df.dtypes)

Here, we’ve used the astype function to change the data type.
Remember, when converting to integer types, make sure that your column doesn’t have any missing values because ‘NaN’ values are considered as floats by pandas. If there are missing values, you may want to fill or drop them before changing the data type.
Additionally, you can convert numeric data to categorical data types. This is especially useful in machine learning when you’re working with categorical variables.

df['Age'] = df['Age'].astype('category')
print(df.dtypes)

 

Aggregating Data

Let’s assume that our DataFrame has columns: ‘Gender’, ‘Occupation’, and ‘Salary’. We want to calculate the average and maximum salary, and the total number of individuals for each ‘Gender’ and ‘Occupation’. Here’s how we can do it:

grouped = df.groupby(['Gender', 'Occupation'])
agg_data = grouped['Salary'].agg(['mean', 'max', 'count'])
print(agg_data)

Output:

                        mean    max  count
Gender Occupation                        
Female Data Analyst    59000  65000     10
       Data Scientist  61000  67000     12
Male   Data Analyst    63000  69000     11
       Data Scientist  67000  72000     15

Here, the agg function is used to perform multiple aggregations at once. It takes a list of functions as input and applies each of them to the ‘Salary’ column of each group.

 

Grouping Data

One of the powerful functionalities provided by pandas is grouping data based on certain criteria, similar to the “GROUP BY” statement in SQL.

This allows us to analyze subsets of the data independently.
Suppose we have a column ‘Gender’ in our DataFrame, and we’d like to group our data by this variable:

grouped = df.groupby('Gender')

Once the data is grouped, we can apply aggregation functions like mean, sum, max, etc. to these groups. For example, if you wanted to know the average salary of each gender, you could do:

mean_salary = grouped['Salary'].mean()
print(mean_salary)

Output:

Gender
Female    60000
Male      65000
dtype: int64

Here, the groupby function is used to separate the data into groups based on ‘Gender’. Then, we calculate the mean salary for each group using the mean function.
We can also group by multiple columns. For example, if you wanted to group by both ‘Gender’ and ‘Occupation’, you would do:

grouped = df.groupby(['Gender', 'Occupation'])
mean_salary = grouped['Salary'].mean()
print(mean_salary)

Output:

Gender  Occupation   
Female  Data Analyst     59000
        Data Scientist   61000
Male    Data Analyst     63000
        Data Scientist   67000
dtype: int64

 

Pivot Tables

Pivot tables are used to summarize, analyze, explore, and present summary statistics of your data. They are especially helpful when you have long data that needs to be reshaped for specific analyses.

Pandas provides a method ‘pivot_table’ to create MS Excel spreadsheet-style pivot tables.
For example, suppose we have a DataFrame df with columns ‘Gender’, ‘Occupation’, and ‘Salary’. If we wanted to create a pivot table to view the average salary by gender and occupation, we could do:

pivot = df.pivot_table(values='Salary', index='Occupation', columns='Gender', aggfunc='mean')
print(pivot)

Output:

Gender          Female     Male
Occupation                     
Data Analyst    59000.0  63000.0
Data Scientist  61000.0  67000.0

Here, the pivot_table function takes three arguments:

  1. values: This is the column that you want to summarize. In this case, it’s the ‘Salary’ column.
  2. index: The column(s) that you want to set as the index of the output table. Here, it’s the ‘Occupation’ column.
  3. columns: The column(s) that you want to convert into columns in the output table. Here, it’s the ‘Gender’ column.
  4. aggfunc: The aggregation function that you want to apply. Here, we’ve used ‘mean’ to calculate the average salary.

 

Correlations

Pandas provides a method, corr, to calculate correlations between columns in a DataFrame.
Let’s say we have a DataFrame df that includes columns ‘Age’, ‘Salary’, and ‘Experience’. To find out the correlation between these columns, we can use:

correlation = df[['Age', 'Salary', 'Experience']].corr()
print(correlation)

Output:

             Age    Salary  Experience
Age        1.0000   0.7467     0.9801
Salary     0.7467   1.0000     0.7698
Experience 0.9801   0.7698     1.0000

This is a correlation matrix, where each cell in the table shows the correlation between two variables. A correlation of 1 indicates a perfect positive correlation, while -1 indicates a perfect negative correlation. A correlation of 0 means that no relationship exists.
We used the corr method, which by default calculates the Pearson correlation. Other correlation methods like Spearman and Kendall can also be specified using the method parameter.

 

Descriptive Statistics

Descriptive statistics summarize and organize characteristics of a dataset. Pandas provides several functions that make it easy to explore and understand the data at a high level.
Suppose we have a DataFrame with columns ‘Age’, ‘Salary’, and ‘Experience’. If we want to get some quick descriptive statistics for these columns, we can use the describe function:

desc_stats = df[['Age', 'Salary', 'Experience']].describe()
print(desc_stats)

Output:

             Age         Salary   Experience
count  100.000000    100.000000  100.000000
mean    29.710000  65000.000000    7.500000
std      6.982456   7055.473572    2.293098
min     22.000000  50000.000000    3.000000
25%     25.000000  60000.000000    6.000000
50%     28.000000  65000.000000    7.000000
75%     32.000000  70000.000000    9.000000
max     45.000000  80000.000000   13.000000

Here, count is the number of non-NA/null values, mean is the mean of the values, std is the standard deviation, min and max are the minimum and maximum values respectively. 25%, 50%, and 75% are the lower quartile, median, and upper quartile respectively.

 

Merging DataFrames

You can combine DataFrames in Pandas using the merge function. Let’s consider two sample DataFrames:

import pandas as pd

df1 = pd.DataFrame({
    'EmployeeID': ['1', '2', '3', '4'],
    'Name': ['Amy', 'Bob', 'Charlie', 'Dave'],
    'Department': ['HR', 'Sales', 'IT', 'Marketing']
})

df2 = pd.DataFrame({
    'EmployeeID': ['3', '4', '5', '6'],
    'Age': [25, 30, 35, 40],
    'Salary': [70000, 80000, 90000, 100000]
})

If you want to merge these DataFrames based on the ‘EmployeeID’, you can do it like this:

df = pd.merge(df1, df2, on='EmployeeID')
print(df)

Output:

  EmployeeID     Name Department  Age  Salary
0          3  Charlie         IT   25   70000
1          4     Dave  Marketing   30   80000

Here, Pandas matched up the ‘EmployeeID’ values across the two DataFrames and joined them together. By default, merge performs an inner join, which means that only the keys that are present in both DataFrames are combined together.

If you want to include all the keys from both DataFrames, you can specify how='outer':

df = pd.merge(df1, df2, on='EmployeeID', how='outer')
print(df)

Output:

  EmployeeID     Name Department   Age   Salary
0          1      Amy         HR   NaN      NaN
1          2      Bob      Sales   NaN      NaN
2          3  Charlie         IT  25.0  70000.0
3          4     Dave  Marketing  30.0  80000.0
4          5      NaN        NaN  35.0  90000.0
5          6      NaN        NaN  40.0 100000.0

 

Concatenating DataFrames

Concatenation is another way to combine DataFrames in pandas. This is typically used when you want to append rows or columns from one DataFrame to another. Let’s consider two sample DataFrames:

df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3'],
    'C': ['C0', 'C1', 'C2', 'C3'],
    'D': ['D0', 'D1', 'D2', 'D3']},
    index=[0, 1, 2, 3])

df2 = pd.DataFrame({
    'A': ['A4', 'A5', 'A6', 'A7'],
    'B': ['B4', 'B5', 'B6', 'B7'],
    'C': ['C4', 'C5', 'C6', 'C7'],
    'D': ['D4', 'D5', 'D6', 'D7']},
    index=[4, 5, 6, 7])

If you want to concatenate these DataFrames along the row axis (i.e., append df2 as new rows to df1), you can do:

df = pd.concat([df1, df2])
print(df)

Output:

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7

By default, concat joins along axis=0 (i.e., it appends as rows). If you want to concatenate along the column axis (i.e., append df2 as new columns to df1), you can specify axis=1:

df = pd.concat([df1, df2], axis=1)

Concatenating DataFrames is often useful in data wrangling where you have data in multiple pieces and you want to combine them into a single DataFrame for further analysis.

 

Reshaping DataFrames

You can use the melt and pivotfunctions to reshape your DataFrames.

Melt

You can use the melt function to transform or reshape data into a format where one or more columns are identifier variables, while all other columns are considered measured variables.
Let’s consider a sample DataFrame:

import pandas as pd

df = pd.DataFrame({
    'Date': ['2023-06-01', '2023-06-01', '2023-06-02', '2023-06-02'],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
    'Temperature': [75, 77, 76, 78],
    'Humidity': [80, 70, 78, 85]
})
print(df)

Output:

         Date         City  Temperature  Humidity
0  2023-06-01     New York           75        80
1  2023-06-01  Los Angeles           77        70
2  2023-06-02     New York           76        78
3  2023-06-02  Los Angeles           78        85

If you want to melt this DataFrame, you can do:

melted = df.melt(id_vars=['Date', 'City'], var_name='Measurement', value_name='Value')
print(melted)

Output:

         Date         City  Measurement  Value
0  2023-06-01     New York  Temperature     75
1  2023-06-01  Los Angeles  Temperature     77
2  2023-06-02     New York  Temperature     76
3  2023-06-02  Los Angeles  Temperature     78
4  2023-06-01     New York     Humidity     80
5  2023-06-01  Los Angeles     Humidity     70
6  2023-06-02     New York     Humidity     78
7  2023-06-02  Los Angeles     Humidity     85

 

Pivot

The pivot function is the inverse of melt. It allows you to transform or reshape data, by turning unique values from one column into separate columns in the DataFrame, and then fills the corresponding cells with the appropriate values.
Continuing with the melted DataFrame from the previous example, if you want to pivot it back into its original form, you can do it like this:

pivoted = melted.pivot(index=['Date', 'City'], columns='Measurement', values='Value')
print(pivoted.reset_index())

Output:

         Date         City  Measurement  Value
0  2023-06-01     New York  Temperature     75
1  2023-06-01  Los Angeles  Temperature     77
2  2023-06-02     New York  Temperature     76
3  2023-06-02  Los Angeles  Temperature     78
4  2023-06-01     New York     Humidity     80
5  2023-06-01  Los Angeles     Humidity     70
6  2023-06-02     New York     Humidity     78
7  2023-06-02  Los Angeles     Humidity     85

 

Column Operations

You can perform several operations on columns such as selecting, renaming, creating, and deleting columns. Let’s consider a sample DataFrame:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Amy', 'Bob', 'Charlie', 'Dave'],
    'Age': [23, 25, 22, 21],
    'Salary': [70000, 80000, 90000, 85000]
})

Selecting Columns

You can select a single column by using bracket notation and the column’s name:

names = df['Name']
print(names)

Output:

0        Amy
1        Bob
2    Charlie
3       Dave
Name: Name, dtype: object

Renaming Columns

To rename a column, you can use the rename function:

df = df.rename(columns={'Name': 'Employee Name'})
print(df)

Creating Columns

To create a new column, you can define it in terms of existing columns:

df['Age Next Year'] = df['Age'] + 1
print(df)

Deleting Columns

To delete a column, you can use the drop function:

df = df.drop(columns='Age Next Year')
print(df)

 

Date and Time Handling

To start, let’s create a sample DataFrame with date strings:

import pandas as pd
df = pd.DataFrame({
    'Name': ['Amy', 'Bob', 'Charlie', 'Dave'],
    'DOB': ['1995-05-01', '1996-07-23', '1997-08-15', '1998-01-10']
})

Converting Strings to Datetime

Often, dates come as strings, and it would be beneficial to convert these into datetime objects to perform date-related operations:

df['DOB'] = pd.to_datetime(df['DOB'])
print(df['DOB'].dtype)  # Output: datetime64[ns]

Extracting Date Components

Once we have a datetime column, we can extract various components like the year, month, day, etc.:

df['Year'] = df['DOB'].dt.year
print(df)

Time Delta

Time delta represents the duration or the difference in time. Let’s assume you have data with timestamps and you want to calculate the time difference between events:

import pandas as pd
time_data = pd.Series(pd.date_range('2023-01-01', periods=3, freq='D'))
print(time_data)

Output:

0 2023-01-01
1 2023-01-02
2 2023-01-03
dtype: datetime64[ns]

To get the time difference (Timedelta) between these dates, you can use the diff() function:

time_delta = time_data.diff()
print(time_delta)

Output:

0 NaT
1 1 days
2 1 days
dtype: timedelta64[ns]

Here, the diff() function calculates the difference between the current and a prior element in the Series.

As there is no prior element for the first timestamp, the result is NaT (Not a Time), which is the pandas representation for a missing or null time data.

Changing Date Format

You can also change the format of your dates using the strftime function:

df['DOB'] = df['DOB'].dt.strftime('%d/%m/%Y')
print(df)

 

Data Visualization

Pandas provides data visualization through different functions, you can plot data in many ways depending on your needs.

Line Plot

Line charts are among the most frequently used chart types in data visualization. They are especially useful for visualizing changes in values over time.
You can use the DataFrame.plot.line() function to generate line plots in pandas. Let’s consider a DataFrame with monthly sales data:

import pandas as pd
df = pd.DataFrame({
    'Month': pd.date_range(start='01-01-2023', periods=12, freq='M'),
    'Sales': [15, 18, 20, 22, 25, 23, 30, 33, 40, 45, 50, 55]
})
df.set_index('Month', inplace=True)

Now, you can plot the sales data:

df.plot.line(title='Monthly Sales')

This will produce a line chart with the months on the x-axis and the sales values on the y-axis.

Pandas line plot

Remember, if you’re using Jupyter notebook, you don’t need to do anything special to view the plot, but if you’re using a different environment, you should use Matplotlib to display the plot otherwise, the plot won’t be visible.

import matplotlib.pyplot as plt
df.plot.line(title='Monthly Sales')
plt.show()

 

Bar Plot

You can use the DataFrame.plot.bar() function to create bar charts. Let’s consider a DataFrame with sales data across different categories:

import pandas as pd
df = pd.DataFrame({
    'Category': ['Electronics', 'Clothing', 'Home & Kitchen', 'Health & Beauty'],
    'Sales': [15000, 12000, 18000, 20000]
})
df.set_index('Category', inplace=True)

Now, to generate a bar chart:

df.plot.bar(title='Sales by Category', color='blue', alpha=0.75)

Pandas bar plot

In the resulting bar chart, each category (Electronics, Clothing, Home & Kitchen, Health & Beauty) has a bar associated with it, and the length of the bar corresponds to the amount of sales in that category.

We used the color argument to set the color of the bars, and the alpha argument sets the transparency of the bars. The title of the chart is set using the title argument.

 

Scatter Plot

Scatter plots are used to display the relationship between two numerical variables. With pandas, you can use the DataFrame.plot.scatter() function to create scatter plots. Let’s consider a DataFrame with age and salary data:

import pandas as pd
df = pd.DataFrame({
    'Age': [22, 25, 30, 35, 40, 45, 50, 55, 60, 65],
    'Salary': [30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000]
})

To generate a scatter plot to examine the relationship between age and salary:

df.plot.scatter(x='Age', y='Salary', title='Age vs Salary')

Pandas scatter plot

In the scatter plot, each point represents a single individual’s age (x-axis) and salary (y-axis). If there’s a clear pattern in the distribution of points, it suggests a relationship between the two variables.

 

Histogram

A histogram is a graphical representation that organizes a group of data points into a specified range.

In Pandas, you can use the DataFrame.plot.hist() function to create histograms. Let’s consider a DataFrame with age data:

import pandas as pd
df = pd.DataFrame({
    'Age': [22, 25, 22, 35, 40, 35, 22, 40, 35, 22, 30, 25, 30, 25, 40, 30, 35, 40]
})

To generate a histogram to visualize the distribution of ages:

df['Age'].plot.hist(bins=6, alpha=0.5, title='Age Distribution')

Pandas Histogram

In this histogram, the ages (x-axis) are grouped into bins (ranges of ages), and the number of occurrences of each range is shown on the y-axis. The bins argument sets the number of bins in the histogram. The alpha argument sets the transparency of the bars.
In this case, the histogram will show that the ages are mostly clustered around 30-35.

 

Box Plot

You can use the DataFrame.plot.box() function to create box plots. Let’s consider a DataFrame with salary data:

import pandas as pd
df = pd.DataFrame({
    'Salary': [35000, 40000, 36000, 41000, 39000, 38000, 37000, 42000, 43000, 45000, 44000, 50000, 55000]
})

Now, to generate a box plot of the salary data:

df['Salary'].plot.box(title='Salary Distribution')

Pandas box plot

In this box plot, the box represents the interquartile range (IQR), i.e., 25th to 75th percentile of the data, the line inside the box is the median (50th percentile), and the whiskers typically represent the range for the rest of the data.

Any points outside of the whiskers can be considered outliers.

 

Pandas comes to the rescue

I was approached by a large US retail company for a rather complex data analysis project.

They were dealing with sales data spread across multiple Excel and CSV files, each containing hundreds of thousands of rows and more than 40 columns of data.

The data files were located on their cloud storage and were updated daily. My task was to create a system that could efficiently handle this data, perform numerous transformations, and churn out insightful reports daily.

At first, I took a stab at the problem the old-fashioned way: I wrote some Python code that utilized the built-in csv library to read and parse the data.

I quickly ran into issues: the sheer amount of data was too large for efficient processing in regular Python.

Even simple tasks, like calculating averages or filtering data based on conditions, became painfully slow.

It was taking roughly 2 hours to process a single file, meaning a full day’s worth of files could potentially take over 48 hours! I quickly realized that this approach was unworkable.

At this moment, I turned to Pandas. A mighty tool indeed. Using it, I could read files directly from the cloud storage, perform necessary transformations on the entire dataset at once, and handle missing or wrong data, all while enjoying the benefits of its highly optimized performance.

With Pandas, reading in a single file took a mere 2 minutes, down from the initial 2 hours – that’s a 60x speed improvement!

The average time required to process a file, including all the transformations and calculations, was around 15 minutes.

That meant I could process an entire day’s worth of files in less than 12 hours, rather than the originally projected 48 hours – slashing the total time by 75%.

Furthermore, the code was far cleaner and more maintainable using Pandas. Because it allowed operations on entire data structures at once, the number of lines in the code was reduced by approximately 65%.

Another significant advantage of using Pandas was its robust handling of missing or incorrect data. Other methods like Python’s csv would’ve required complex and time-consuming error handling for each individual operation. With Pandas, this was handled almost automatically, saving an additional estimated 20% in overall development time.

My time spent on the project decreased by nearly 80%, and the client was delighted with the efficiency and accuracy of the daily reports.

Ultimately, it was a stark reminder: in the realm of data science, the right tools can make all the difference.


Leave a Reply

Your email address will not be published. Required fields are marked *