20 Ways to Filter Pandas DataFrame
Python’s Pandas is a powerful library for data manipulation and analysis. One of its primary features is the ability to filter data from a Pandas DataFrame. There are many different ways to filter data in Pandas.
It enables you to inspect specific rows or a subset of data, filter Pandas DataFrame by column or multiple columns, or even filter Pandas Dataframe based on a condition or a given list of conditions.
In this tutorial, we’ll explore the following methods to filter a Pandas DataFrame.
- 1 Import Pandas
- 2 Create a sample DataFrame
- 3 Filter Rows Based on a Single Column Value
- 4 Filter Rows Using Multiple Column Values
- 5 Filter Rows Using Logical Operators (AND, OR, NOT)
- 6 Filter Rows Based on Null or Missing Values
- 7 Filter Rows Using the isin Method
- 8 Filter Rows Based on a Regular Expression (Regex)
- 9 Filter Rows Using the loc Function
- 10 Filter Rows Using the iloc Function
- 11 Filter Rows Using the Tilde (~) Operator
- 12 Using Relational Operators (<, >, <=, >=, ==)
- 13 Filter Rows Based on a Range of Values
- 14 Filter Rows Based on Index Labels
- 15 Filter Rows Using a Custom Function
- 16 Filter Columns Using Column Names
- 17 Filter Columns Using Regular Expressions (Regex)
- 18 Filter Data by Both Rows and Columns
- 19 Filter Rows and Columns Based on Specific Conditions
- 20 Filter DataFrame Rows by a Condition in Two Columns
- 21 Filter DataFrame Rows Using the query Method
- 22 Filtering Pandas DataFrame saved me
- 23 Conclusion
Import Pandas
To start with, we need to import the Pandas package. This is usually done with the following command:
import pandas as pd
Create a sample DataFrame
We can create a simple DataFrame for this tutorial:
data = { 'Name': ['Alex', 'John', 'Richard', 'Tom', 'Kim', 'Lily', 'Emma', 'Jack'], 'Age': [23, 34, 30, 28, 32, 26, 29, 25], 'City': ['New York', 'London', 'London', 'Paris', 'Berlin', 'Berlin', 'London', 'New York'], 'Gender': ['M', 'M', 'M', 'M', 'F', 'F', 'F', 'M'] } df = pd.DataFrame(data) # Display the first five rows of the dataframe print(df.head())
This will return the following output:
Name Age City Gender 0 Alex 23 New York M 1 John 34 London M 2 Richard 30 London M 3 Tom 28 Paris M 4 Kim 32 Berlin F
In the above code, we created a DataFrame from a Python dictionary and we print only the first five rows using the head method.
Filter Rows Based on a Single Column Value
You can filter Pandas DataFrame rows based on a specific value in a column. For example, to filter rows where ‘Gender’ is ‘M’:
df_filtered = df[df['Gender'] == 'M'] print(df_filtered)
This will return:
Name Age City Gender 0 Alex 23 New York M 1 John 34 London M 2 Richard 30 London M 3 Tom 28 Paris M 7 Jack 25 New York M
Filter Rows Using Multiple Column Values
To filter rows based on multiple column values, you can use multiple conditions:
df_filtered = df[(df['Gender'] == 'M') & (df['Age'] > 30)] print(df_filtered)
The result will be:
Name Age City Gender 1 John 34 London M
Here we filter the DataFrame and return only rows with Gender = “M” and age greater than 30.
Filter Rows Using Logical Operators (AND, OR, NOT)
Pandas allows the use of logical operators (& for AND, | for OR, ~ for NOT) to filter a DataFrame. Here, we are filtering rows where ‘Gender’ is ‘M’ OR ‘City’ is ‘New York’:
df_filtered = df[(df['Gender'] == 'M') | (df['City'] == 'New York')] print(df_filtered)
This will return:
Name Age City Gender 0 Alex 23 New York M 1 John 34 London M 2 Richard 30 London M 3 Tom 28 Paris M 7 Jack 25 New York M
Filter Rows Based on Null or Missing Values
You can filter DataFrame rows based on null or missing values. In Pandas, NaN (Not a Number) is used to represent missing values. You can use the isnull
function to filter rows that contain missing values:
df_filtered = df['Age'].isnull() print(df_filtered)
This returns a Series with ‘True’ for rows with missing values and ‘False’ for rows without missing values:
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False Name: Age, dtype: bool
Filter Rows Using the isin Method
The isin
method is used to filter rows based on a list of values. It returns a Boolean Series that is True where the column value is in the given list and it uses this Boolean series to select rows who has True only:
df_filtered = df[df['City'].isin(['London', 'Berlin'])] print(df_filtered)
This will give us:
Name Age City Gender 1 John 34 London M 2 Richard 30 London M 4 Kim 32 Berlin F 5 Lily 26 Berlin F 6 Emma 29 London F
Filter Rows Based on a Regular Expression (Regex)
Pandas allows filtering of DataFrame rows using regular expressions. The str.contains
function can be used with a regex pattern by setting the regex parameter to True:
df_filtered = df[df['Name'].str.contains('^A', regex=True)] print(df_filtered)
Here is the resulting DataFrame:
Name Age City Gender 0 Alex 23 New York M
Filter Rows Using the loc Function
The loc
function is used to filter rows based on the labels of columns. It can also accept boolean data, allowing it to be used with logical operators:
df_filtered = df.loc[df['Age'] > 30] print(df_filtered)
This results in:
Name Age City Gender 1 John 34 London M 4 Kim 32 Berlin F
Here we only return rows with ages greater than 30.
Filter Rows Using the iloc Function
The iloc
function is used to filter rows based on position:
df_filtered = df.iloc[0:5] print(df_filtered)
This will return the first five rows of the DataFrame:
Name Age City Gender 0 Alex 23 New York M 1 John 34 London M 2 Richard 30 London M 3 Tom 28 Paris M 4 Kim 32 Berlin F
The above code returns rows whose position is between 0 and 5. The iloc is zero-based, so it starts counting from position zero.
Filter Rows Using the Tilde (~) Operator
The tilde (~) operator is used to filter rows that do not match a condition:
df_filtered = df[~(df['City'] == 'New York')] print(df_filtered)
The resulting DataFrame will be:
Name Age City Gender 1 John 34 London M 2 Richard 30 London M 3 Tom 28 Paris M 4 Kim 32 Berlin F 5 Lily 26 Berlin F 6 Emma 29 London F
Here the code only returns rows where City is not ‘New York’.
Using Relational Operators (<, >, <=, >=, ==)
You can also filter rows using relational operators:
df_filtered = df[df['Age'] > 30] print(df_filtered)
This will return:
Name Age City Gender 1 John 34 London M 4 Kim 32 Berlin F
As you can see, it returns rows where age is greater than 30.
Filter Rows Based on a Range of Values
To filter rows within a certain range of values:
df_filtered = df[(df['Age'] >= 25) & (df['Age'] <= 30)] df_filtered
This will result in:
Name Age City Gender 2 Richard 30 London M 3 Tom 28 Paris M 5 Lily 26 Berlin F 6 Emma 29 London F 7 Jack 25 New York M
The above code returns rows where the age is greater than or equal 25 and less than or equal 30.
Filter Rows Based on Index Labels
Pandas DataFrame has an Index
object, which allows us to quickly select rows based on index labels using the loc
function:
df_filtered = df.loc[[0, 2, 4]] print(df_filtered)
This will give us:
Name Age City Gender 0 Alex 23 New York M 2 Richard 30 London M 4 Kim 32 Berlin F
As you can see, we selected rows with indexes 0, 2, 4.
Filter Rows Using a Custom Function
You can also use a custom function to filter rows:
def filter_func(row): return row['Name'].startswith('A') df_filtered = df[df.apply(filter_func, axis=1)] print(df_filtered)
This will return:
Name Age City Gender 0 Alex 23 New York M
Here we defined a function and filtered rows where the Name starts with the letter “A”.
Filter Columns Using Column Names
You can filter columns using the column names:
df_filtered = df[['Name', 'Age']] print(df_filtered)
The output will be:
Name Age 0 Alex 23 1 John 34 2 Richard 30 3 Tom 28 4 Kim 32 5 Lily 26 6 Emma 29 7 Jack 25
As you can see we only return two columns “Name” and “Age”.
Filter Columns Using Regular Expressions (Regex)
You can use regular expressions to filter columns:
df_filtered = df.filter(regex='^N') print(df_filtered)
This will return:
Name 0 Alex 1 John 2 Richard 3 Tom 4 Kim 5 Lily 6 Emma 7 Jack
The above code filters columns where the column name starts with the letter “N”.
Filter Data by Both Rows and Columns
You can filter both rows and columns using the loc
function:
df.loc[df['Age'] > 30, ['Name', 'City']]
This will return:
Name City 1 John London 4 Kim Berlin
The above code returns only rows for those who have an age greater than 30 and it only gets the “Name” and the “City”.
Filter Rows and Columns Based on Specific Conditions
You can use specific conditions to filter rows and columns:
df_filtered = df.loc[df['Age'] > 30, df.columns.str.startswith('N')] print(df_filtered)
This will return:
Name 1 John 4 Kim
Here we have a mixed condition, return rows where Age is greater than 30 and only get the Name column.
Filter DataFrame Rows by a Condition in Two Columns
You can filter DataFrame rows based on conditions in two columns:
df_filtered = df[(df['Age'] > 30) & (df['City'] == 'London')] print(df_filtered)
This will return:
Name Age City Gender 1 John 34 London M
Here both conditions belong to columns. we get rows where Age is greater than 30 and City= “London”.
Filter DataFrame Rows Using the query Method
You can use the query
method to filter DataFrame rows by passing the conditions as a string like this:
df_filtered = df.query("Age > 30 and City == 'London'") print(df_filtered)
This will return:
Name Age City Gender 1 John 34 London M
Filtering Pandas DataFrame saved me
I have countless stories of how the power of Python, and specifically the Pandas package, has drastically reduced the time and effort required for data manipulation tasks.
But one story stands out. This story is about the time I was working on a colossal freelance project, which required sifting through a massive dataset and performing intricate data filtering.
The project involved analyzing a dataset containing historical sales data of an international e-commerce company.
The dataset contained more than 10 million rows, with dozens of columns representing various data points such as product ID, customer ID, sales region, product category, sale date, and many others. The objective was to perform a targeted sales analysis, which required extracting specific subsets of data based on multiple conditions.
Imagine manually going through millions of records and trying to extract a particular subset. It would be a nightmare! This is where filtering with Pandas DataFrame came to the rescue.
One part of the project required identifying customers from a specific region, say ‘North America’, who made purchases above a certain dollar amount, say $500, in the ‘Electronics’ category.
With the power of Pandas DataFrame filtering, I accomplished this with just a single line of code:
df_filtered = df[(df['Region'] == 'North America') & (df['Purchase Amount'] > 500) & (df['Category'] == 'Electronics')]
This line filtered out roughly 50,000 transactions that matched the specified conditions – a mere 0.5% of the entire dataset, but incredibly valuable for the targeted analysis.
Another crucial part of the project involved time-series analysis. I needed to analyze the monthly sales trend of each product category. The problem was that the ‘Sale Date’ column was in string format and included both date and time, which was unnecessary for this analysis.
Here’s how I tackled this with Pandas:
df['Sale Date'] = pd.to_datetime(df['Sale Date']).dt.to_period('M')
And then, to filter data for a specific month:
df_filtered = df[df['Sale Date'] == '2023-01']
In another instance, I was asked to focus on the top 1% of customers based on their total purchase amount. Thanks to Pandas, this seemed like a child’s play:
top_customers = df.groupby('Customer ID')['Purchase Amount'].sum().nlargest(int(0.01*df['Customer ID'].nunique()))
Each of these filtering operations took just a few seconds to run.
Conclusion
This tutorial has covered various methods to filter data in a Pandas DataFrame. Filtering data is a key step in data analysis and preparation, and Pandas provides a powerful and flexible toolkit for this task.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.