Pandas query() vs filter(): Which Method You Should Use?
Two common methods that you’ll often use in Pandas are query()
and filter()
.
The main difference: The query()
method is used mainly to filter rows using string expressions while filter
method is used mainly for column selection.
In this tutorial, you’ll understand the differences between them and when to use which.
Filter Rows
Using query Method
Here’s a basic example:
import pandas as pd data = {'user_id': [1, 2, 3, 4], 'age': [24, 30, 22, 26], 'plan_type': ['Basic', 'Premium', 'Basic', 'Premium']} df = pd.DataFrame(data) result = df.query('age > 25') print(result)
Output:
user_id age plan_type 1 2 30 Premium 3 4 26 Premium
Here, you filtered the DataFrame to include only the rows where the age is greater than 25.
You can also use variable substitution in the query
method like this:
age_threshold = 25 result = df.query('age > @age_threshold') print(result)
Output:
user_id age plan_type 1 2 30 Premium 3 4 26 Premium
Notice the use of the “@” symbol for variable substitution, allowing you to pass in variables dynamically.
Using filter Method
The filter()
method cannot be used to filter rows.
Column Selection
As we mentioned before, the query
method is not mainly for selecting columns, but you can select specific columns along with row filtering.
Using query()
You can select specific columns while filtering rows by applying the query first and then selecting the desired columns:
data = {'user_id': [1, 2, 3, 4], 'age': [24, 30, 22, 26], 'plan_type': ['Basic', 'Premium', 'Basic', 'Premium']} df = pd.DataFrame(data) # Using query to filter rows and select specific columns query_columns_result = df.query('age > 25')[['user_id', 'age']] print(query_columns_result)
Output:
user_id age 1 2 30 3 4 26
You filtered rows where the age is greater than 25 and then selected only the user_id
and age
columns.
Using filter()
The filter
method generally acts on the DataFrame’s columns as a whole and doesn’t provide the flexibility to filter rows at the same time:
filter_columns_result = df.filter(items=['user_id', 'age']) print(filter_columns_result)
Output:
user_id age 0 1 24 1 2 30 2 3 22 3 4 26
Here, you’ve selected only the user_id
and age
columns, but all rows are included, unlike with query
.
You can also use regular expressions to filter columns:
regex_filtered_df = df.filter(regex='id|type') print(regex_filtered_df)
Output:
user_id plan_type 0 1 Basic 1 2 Premium 2 3 Basic 3 4 Premium
In this example, columns that match the regular expression ‘id|type’ are selected.
When to Use query method?
- Row Filtering: If you have a complex condition based on which you want to filter rows,
query
is your choice. - Column and Row Operations: When you need to perform both column selection and row filtering,
query
followed by standard Python indexing is often more convenient. - In-Place Operations: If you want to modify the DataFrame in place while filtering rows,
query
with theinplace=True
parameter is your choice.
When to Use filter method?
- Column Filtering: When you only need to select specific columns,
filter
is a straightforward option. - Substring Matching: To select columns based on partial name matches,
filter
with thelike
orregex
parameters is your choice.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.