Pandas query() vs filter(): Which Method You Should Use?

Two common methods that you’ll often use in Pandas are query() and filter().

The main difference: The query() method is used mainly to filter rows using string expressions while filter method is used mainly for column selection.

In this tutorial, you’ll understand the differences between them and when to use which.

 

 

Filter Rows

Using query Method

Here’s a basic example:

import pandas as pd
data = {'user_id': [1, 2, 3, 4],
        'age': [24, 30, 22, 26],
        'plan_type': ['Basic', 'Premium', 'Basic', 'Premium']}
df = pd.DataFrame(data)
result = df.query('age > 25')
print(result)

Output:

   user_id  age plan_type
1        2   30   Premium
3        4   26   Premium

Here, you filtered the DataFrame to include only the rows where the age is greater than 25.

You can also use variable substitution in the query method like this:

age_threshold = 25
result = df.query('age > @age_threshold')
print(result)

Output:

   user_id  age plan_type
1        2   30   Premium
3        4   26   Premium

Notice the use of the “@” symbol for variable substitution, allowing you to pass in variables dynamically.

Using filter Method

The filter() method cannot be used to filter rows.

 

Column Selection

As we mentioned before, the query method is not mainly for selecting columns, but you can select specific columns along with row filtering.

Using query()

You can select specific columns while filtering rows by applying the query first and then selecting the desired columns:

data = {'user_id': [1, 2, 3, 4],
        'age': [24, 30, 22, 26],
        'plan_type': ['Basic', 'Premium', 'Basic', 'Premium']}
df = pd.DataFrame(data)

# Using query to filter rows and select specific columns
query_columns_result = df.query('age > 25')[['user_id', 'age']]
print(query_columns_result)

Output:

   user_id  age
1        2   30
3        4   26

You filtered rows where the age is greater than 25 and then selected only the user_id and age columns.

Using filter()

The filter method generally acts on the DataFrame’s columns as a whole and doesn’t provide the flexibility to filter rows at the same time:

filter_columns_result = df.filter(items=['user_id', 'age'])
print(filter_columns_result)

Output:

   user_id  age
0        1   24
1        2   30
2        3   22
3        4   26

Here, you’ve selected only the user_id and age columns, but all rows are included, unlike with query.

You can also use regular expressions to filter columns:

regex_filtered_df = df.filter(regex='id|type')
print(regex_filtered_df)

Output:

   user_id plan_type
0        1     Basic
1        2   Premium
2        3     Basic
3        4   Premium

In this example, columns that match the regular expression ‘id|type’ are selected.

 

When to Use query method?

  1. Row Filtering: If you have a complex condition based on which you want to filter rows, query is your choice.
  2. Column and Row Operations: When you need to perform both column selection and row filtering, query followed by standard Python indexing is often more convenient.
  3. In-Place Operations: If you want to modify the DataFrame in place while filtering rows, query with the inplace=True parameter is your choice.

 

When to Use filter method?

  1. Column Filtering: When you only need to select specific columns, filter is a straightforward option.
  2. Substring Matching: To select columns based on partial name matches, filter with the like or regex parameters is your choice.
Leave a Reply

Your email address will not be published. Required fields are marked *