20 Ways to Filter Pandas DataFrame

Python’s Pandas is a powerful library for data manipulation and analysis. One of its primary features is the ability to filter data from a Pandas DataFrame. There are many different ways to filter data in Pandas.

It enables you to inspect specific rows or a subset of data, filter Pandas DataFrame by column or multiple columns, or even filter Pandas Dataframe based on a condition or a given list of conditions.

In this tutorial, we’ll explore the following methods to filter a Pandas DataFrame.

 

 

Import Pandas

To start with, we need to import the Pandas package. This is usually done with the following command:

import pandas as pd

 

Create a sample DataFrame

We can create a simple DataFrame for this tutorial:

data = {
    'Name': ['Alex', 'John', 'Richard', 'Tom', 'Kim', 'Lily', 'Emma', 'Jack'],
    'Age': [23, 34, 30, 28, 32, 26, 29, 25],
    'City': ['New York', 'London', 'London', 'Paris', 'Berlin', 'Berlin', 'London', 'New York'],
    'Gender': ['M', 'M', 'M', 'M', 'F', 'F', 'F', 'M']
}

df = pd.DataFrame(data)

# Display the first five rows of the dataframe
print(df.head())

This will return the following output:

    Name  Age       City Gender
0   Alex   23   New York      M
1   John   34     London      M
2 Richard   30     London      M
3    Tom   28      Paris      M
4    Kim   32     Berlin      F

In the above code, we created a DataFrame from a Python dictionary and we print only the first five rows using the head method.

 

Filter Rows Based on a Single Column Value

You can filter Pandas DataFrame rows based on a specific value in a column. For example, to filter rows where ‘Gender’ is ‘M’:

df_filtered = df[df['Gender'] == 'M']
print(df_filtered)

This will return:

     Name  Age      City Gender
0    Alex   23  New York      M
1    John   34    London      M
2 Richard   30    London      M
3     Tom   28     Paris      M
7    Jack   25  New York      M

 

Filter Rows Using Multiple Column Values

To filter rows based on multiple column values, you can use multiple conditions:

df_filtered = df[(df['Gender'] == 'M') & (df['Age'] > 30)]
print(df_filtered)

The result will be:

   Name  Age     City Gender
1  John   34   London      M

Here we filter the DataFrame and return only rows with Gender = “M” and age greater than 30.

 

Filter Rows Using Logical Operators (AND, OR, NOT)

Pandas allows the use of logical operators (& for AND, | for OR, ~ for NOT) to filter a DataFrame. Here, we are filtering rows where ‘Gender’ is ‘M’ OR ‘City’ is ‘New York’:

df_filtered = df[(df['Gender'] == 'M') | (df['City'] == 'New York')]
print(df_filtered)

This will return:

     Name  Age      City Gender
0    Alex   23  New York      M
1    John   34    London      M
2 Richard   30    London      M
3     Tom   28     Paris      M
7    Jack   25  New York      M

 

Filter Rows Based on Null or Missing Values

You can filter DataFrame rows based on null or missing values. In Pandas, NaN (Not a Number) is used to represent missing values. You can use the isnull function to filter rows that contain missing values:

df_filtered = df['Age'].isnull()
print(df_filtered)

This returns a Series with ‘True’ for rows with missing values and ‘False’ for rows without missing values:

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
Name: Age, dtype: bool

 

Filter Rows Using the isin Method

The isin method is used to filter rows based on a list of values. It returns a Boolean Series that is True where the column value is in the given list and it uses this Boolean series to select rows who has True only:

df_filtered = df[df['City'].isin(['London', 'Berlin'])]
print(df_filtered)

This will give us:

      Name  Age    City Gender
1     John   34  London      M
2  Richard   30  London      M
4      Kim   32  Berlin      F
5     Lily   26  Berlin      F
6     Emma   29  London      F

 

Filter Rows Based on a Regular Expression (Regex)

Pandas allows filtering of DataFrame rows using regular expressions. The str.contains function can be used with a regex pattern by setting the regex parameter to True:

df_filtered = df[df['Name'].str.contains('^A', regex=True)]
print(df_filtered)

Here is the resulting DataFrame:

   Name  Age      City Gender
0  Alex   23  New York      M

 

Filter Rows Using the loc Function

The loc function is used to filter rows based on the labels of columns. It can also accept boolean data, allowing it to be used with logical operators:

df_filtered = df.loc[df['Age'] > 30]
print(df_filtered)

This results in:

   Name  Age    City Gender
1  John   34  London      M
4   Kim   32  Berlin      F

Here we only return rows with ages greater than 30.

 

Filter Rows Using the iloc Function

The iloc function is used to filter rows based on position:

df_filtered = df.iloc[0:5]
print(df_filtered)

This will return the first five rows of the DataFrame:

     Name  Age      City Gender
0    Alex   23  New York      M
1    John   34    London      M
2 Richard   30    London      M
3     Tom   28     Paris      M
4     Kim   32    Berlin      F

The above code returns rows whose position is between 0 and  5. The iloc is zero-based, so it starts counting from position zero.

 

Filter Rows Using the Tilde (~) Operator

The tilde (~) operator is used to filter rows that do not match a condition:

df_filtered = df[~(df['City'] == 'New York')]
print(df_filtered)

The resulting DataFrame will be:

      Name  Age    City Gender
1     John   34  London      M
2  Richard   30  London      M
3      Tom   28   Paris      M
4      Kim   32  Berlin      F
5     Lily   26  Berlin      F
6     Emma   29  London      F

Here the code only returns rows where City is not ‘New York’.

 

Using Relational Operators (<, >, <=, >=, ==)

You can also filter rows using relational operators:

df_filtered = df[df['Age'] > 30]
print(df_filtered)

This will return:

   Name  Age    City Gender
1  John   34  London      M
4   Kim   32  Berlin      F

As you can see, it returns rows where age is greater than 30.

 

Filter Rows Based on a Range of Values

To filter rows within a certain range of values:

df_filtered = df[(df['Age'] >= 25) & (df['Age'] <= 30)]
df_filtered

This will result in:

   Name  Age  City  Gender
2  Richard  30 London  M
3  Tom  28  Paris  M
5  Lily 26  Berlin  F
6  Emma  29  London  F
7  Jack 25  New York  M

The above code returns rows where the age is greater than or equal 25 and less than or equal 30.

 

Filter Rows Based on Index Labels

Pandas DataFrame has an Index object, which allows us to quickly select rows based on index labels using the loc function:

df_filtered = df.loc[[0, 2, 4]]
print(df_filtered)

This will give us:

     Name  Age      City Gender
0    Alex   23  New York      M
2 Richard   30    London      M
4      Kim   32   Berlin      F

As you can see, we selected rows with indexes 0, 2, 4.

 

Filter Rows Using a Custom Function

You can also use a custom function to filter rows:

def filter_func(row):
    return row['Name'].startswith('A')

df_filtered = df[df.apply(filter_func, axis=1)]
print(df_filtered)

This will return:

   Name  Age      City Gender
0  Alex   23  New York      M

Here we defined a function and filtered rows where the Name starts with the letter “A”.

 

Filter Columns Using Column Names

You can filter columns using the column names:

df_filtered = df[['Name', 'Age']]
print(df_filtered)

The output will be:

     Name  Age
0    Alex   23
1    John   34
2 Richard   30
3     Tom   28
4     Kim   32
5     Lily   26
6     Emma   29
7    Jack   25

As you can see we only return two columns “Name” and “Age”.

 

Filter Columns Using Regular Expressions (Regex)

You can use regular expressions to filter columns:

df_filtered = df.filter(regex='^N')
print(df_filtered)

This will return:

     Name
0    Alex
1    John
2 Richard
3     Tom
4     Kim
5     Lily
6     Emma
7    Jack

The above code filters columns where the column name starts with the letter “N”.

 

Filter Data by Both Rows and Columns

You can filter both rows and columns using the loc function:

df.loc[df['Age'] > 30, ['Name', 'City']]

This will return:

   Name    City
1  John  London
4   Kim  Berlin

The above code returns only rows for those who have an age greater than 30 and it only gets the “Name” and the “City”.

 

Filter Rows and Columns Based on Specific Conditions

You can use specific conditions to filter rows and columns:

df_filtered = df.loc[df['Age'] > 30, df.columns.str.startswith('N')]
print(df_filtered)

This will return:

     Name
1    John
4     Kim

Here we have a mixed condition, return rows where Age is greater than 30 and only get the Name column.

 

Filter DataFrame Rows by a Condition in Two Columns

You can filter DataFrame rows based on conditions in two columns:

df_filtered = df[(df['Age'] > 30) & (df['City'] == 'London')]
print(df_filtered)

This will return:

   Name  Age    City Gender
1  John   34  London      M

Here both conditions belong to columns. we get rows where Age is greater than 30 and City= “London”.

 

Filter DataFrame Rows Using the query Method

You can use the query method to filter DataFrame rows by passing the conditions as a string like this:

df_filtered = df.query("Age > 30 and City == 'London'")
print(df_filtered)

This will return:

   Name  Age    City Gender
1  John   34  London      M

 

Filtering Pandas DataFrame saved me

I have countless stories of how the power of Python, and specifically the Pandas package, has drastically reduced the time and effort required for data manipulation tasks.

But one story stands out. This story is about the time I was working on a colossal freelance project, which required sifting through a massive dataset and performing intricate data filtering.

The project involved analyzing a dataset containing historical sales data of an international e-commerce company.

The dataset contained more than 10 million rows, with dozens of columns representing various data points such as product ID, customer ID, sales region, product category, sale date, and many others. The objective was to perform a targeted sales analysis, which required extracting specific subsets of data based on multiple conditions.

Imagine manually going through millions of records and trying to extract a particular subset. It would be a nightmare! This is where filtering with Pandas DataFrame came to the rescue.

One part of the project required identifying customers from a specific region, say ‘North America’, who made purchases above a certain dollar amount, say $500, in the ‘Electronics’ category.

With the power of Pandas DataFrame filtering, I accomplished this with just a single line of code:

df_filtered = df[(df['Region'] == 'North America') & (df['Purchase Amount'] > 500) & (df['Category'] == 'Electronics')]

This line filtered out roughly 50,000 transactions that matched the specified conditions – a mere 0.5% of the entire dataset, but incredibly valuable for the targeted analysis.

Another crucial part of the project involved time-series analysis. I needed to analyze the monthly sales trend of each product category. The problem was that the ‘Sale Date’ column was in string format and included both date and time, which was unnecessary for this analysis.

Here’s how I tackled this with Pandas:

df['Sale Date'] = pd.to_datetime(df['Sale Date']).dt.to_period('M')

And then, to filter data for a specific month:

df_filtered = df[df['Sale Date'] == '2023-01']

In another instance, I was asked to focus on the top 1% of customers based on their total purchase amount. Thanks to Pandas, this seemed like a child’s play:

top_customers = df.groupby('Customer ID')['Purchase Amount'].sum().nlargest(int(0.01*df['Customer ID'].nunique()))

Each of these filtering operations took just a few seconds to run.

 

Conclusion

This tutorial has covered various methods to filter data in a Pandas DataFrame. Filtering data is a key step in data analysis and preparation, and Pandas provides a powerful and flexible toolkit for this task.

Leave a Reply

Your email address will not be published. Required fields are marked *