Pandas Query Method: Filter DataFrames Efficiently

Pandas offers the query()method that allows users to query DataFrames using string expressions.

In this tutorial, you’ll learn about the capabilities of the query method and how to use it in real-world applications.

 

 

Pandas query Syntax and Parameters

The syntax for the query() method is as follows:

DataFrame.query(expr, inplace=False, **kwargs)
  • expr: A string expression, which is similar to the expressions you’d use in SQL. It’s the condition or set of conditions you’d like to apply to the DataFrame.
  • inplace: A boolean (default is False). If True, it modifies the DataFrame in place and returns None. Otherwise, it returns a new DataFrame that’s been filtered.
  • kwargs: Other keyword arguments that can be passed, especially useful for more complex queries. You can often use these to refer to external variables, which are sometimes easier to plug into your query strings.

For instance, if you want to query a DataFrame based on values in a column named ‘age’, you could use an expression like age > 25.

 

Filter a DataFrame Using query Method

To understand the power and simplicity of the query() method, let’s start by creating a DataFrame using Pandas.

import pandas as pd
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [24, 28, 22, 35, 30],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}
df = pd.DataFrame(data)
print(df)

Output:

      name  age         city
0    Alice   24     New York
1      Bob   28  Los Angeles
2  Charlie   22      Chicago
3    David   35      Houston
4      Eve   30      Phoenix

Now, let’s use the query() method to filter out rows where the age is greater than 25.

result = df.query('age > 25')
print(result)

Output:

    name  age         city
1    Bob   28  Los Angeles
3  David   35      Houston
4    Eve   30      Phoenix

From the output, you can observe that only the rows where the age exceeds 25 are displayed.

 

Using Comparison Operators

You can use the comparison operator with the query() method to filter your data.

Equal To (==)

To filter rows based on exact matches, use the == operator.

result_equal = df.query('city == "New York"')
print(result_equal)

Output:

    name  age      city
0  Alice   24  New York

In this case, only the row where the city is “New York” is displayed.

Not Equal To (!=)

To filter out rows based on non-matching criteria, use the != operator.

result_not_equal = df.query('city != "New York"')
print(result_not_equal)

Output:

      name  age         city
1      Bob   28  Los Angeles
2  Charlie   22      Chicago
3    David   35      Houston
4      Eve   30      Phoenix

Here, all rows except the one where the city is “New York” are shown.

Greater Than (>) and Less Than (<)

We’ve already explored the > operator at the beginning of the tutorial. Let’s now retrieve rows where the age is less than 30.

result_less = df.query('age < 30')
print(result_less)

Output:

      name  age         city
0    Alice   24     New York
1      Bob   28  Los Angeles
2  Charlie   22      Chicago

This output showcases individuals aged less than 30.

 

Using Multiple Conditions

You can use multiple conditions with the query() method to narrow down your results or to broaden the scope of your data filtering.

And (&)

To meet multiple conditions simultaneously, use the & operator.

result_and = df.query('(age > 25) & (city == "Los Angeles")')
print(result_and)

Output:

  name  age         city
1  Bob   28  Los Angeles

Here, only the rows meeting both criteria — age greater than 25 and city being “Los Angeles” — are displayed.

Remember, when combining conditions with the & operator, it’s crucial to enclose individual conditions within parentheses. This ensures that they are evaluated correctly.

Or (|)

You can use the | operator to retrieve rows that satisfy at least one of several conditions:

result_or = df.query('(age < 25) | (city == "Houston")')
print(result_or)

Output:

      name  age      city
0    Alice   24  New York
2  Charlie   22   Chicago
3    David   35   Houston

This query fetches rows where the individual is either aged below 25 or lives in Houston.

Not (~)

To invert the result of a condition, use the ~ operator.

result_not = df.query('~(city == "New York")')
print(result_not)

Output:

      name  age         city
1      Bob   28  Los Angeles
2  Charlie   22      Chicago
3    David   35      Houston
4      Eve   30      Phoenix

Here, you get rows where the city is not “New York”.

 

In-place Modification

The inplace parameter determines whether the DataFrame is modified in place or if a new DataFrame is returned.

By default, inplace is set to False, which means the original DataFrame remains unchanged and a new filtered DataFrame is returned.

# Using the previously defined 'df'
result = df.query('age > 25')
print("Filtered DataFrame:\n", result)
print("\nOriginal DataFrame:\n", df)

Output:

Filtered DataFrame:
     name  age         city
1    Bob   28  Los Angeles
3  David   35      Houston
4    Eve   30      Phoenix

Original DataFrame:
       name  age         city
0    Alice   24     New York
1      Bob   28  Los Angeles
2  Charlie   22      Chicago
3    David   35      Houston
4      Eve   30      Phoenix

As you can see, the original DataFrame remains unchanged.

Modifying In-place (inplace=True)

If you set the inplace parameter to True, the original DataFrame will be directly modified.

This is memory efficient, especially with large DataFrames, as no new object is created.

df.query('age > 25', inplace=True)
print(df)

Output:

    name  age         city
1    Bob   28  Los Angeles
3  David   35      Houston
4    Eve   30      Phoenix

Here, the DataFrame ‘df’ has been directly modified to reflect only those rows where the age is greater than 25.

Always proceed with caution when using inplace=True as it makes irreversible changes to your original data.

It’s good practice to ensure you have backups or are certain about the modifications.

 

Using String Methods

The query() method offers some string methods for text-based querying.

Checking the Presence of Substring with .str.contains()

You can use the .str.contains() method to filter rows based on the presence of a particular substring within a string column:

result_contains = df.query('city.str.contains("New")')
print(result_contains)

Output:

    name  age      city
0  Alice   24  New York

In the example above, the query retrieves rows where the ‘city’ column contains the substring “New”.

Case-Insensitive Queries

You can use the case parameter to make queries case insensitive:

result_case_insensitive = df.query('city.str.contains("new", case=False)')
print(result_case_insensitive)

Output:

    name  age      city
0  Alice   24  New York

Even though “new” is in lowercase, the query successfully matches “New York”.

Matching Start or End of Strings

These string methods allow you to match substrings at the beginning or end of your data respectively.

result_start = df.query('name.str.startswith("A")')
print(result_start)
result_end = df.query('name.str.endswith("e")')
print(result_end)

Output:

    name  age      city
0  Alice   24  New York
---------
      name  age      city
0    Alice   24  New York
2  Charlie   22   Chicago
4      Eve   30   Phoenix

Here, the first query fetches rows where the name starts with the letter “A” while the second query fetches rows where the name ends with the letter “e”.

Read more about using string methods in Pandas query().

 

Querying with Index

You can use the index keyword in the query() method if your DataFrame has a default integer index.

result_index = df.query('index > 2')
print(result_index)

Output:

    name  age      city
3  David   35   Houston
4    Eve   30   Phoenix

The output showcases rows where the index value exceeds 2.

Querying with Custom Indexes

If the index is not an integer, you can use the same index keyword to perform queries.

df_custom_index = df.set_index('name')
result_custom_index = df_custom_index.query('index == "Alice"')
print(result_custom_index)

Output:

       age      city
name                
Alice   24  New York

The query then retrieves the row with the index value “Alice”.

Combining Column and Index Queries

You can also combine queries on the index and columns to refine your data extraction process further.

result_combined = df_custom_index.query('(index == "Alice") or (age > 30)')
print(result_combined)

Output:

       age      city
name                
Alice   24  New York
David   35   Houston

The resulting DataFrame displays rows where either the name is “Alice” or the age is greater than 30.

 

Query method with Multi-index DataFrames

Let’s start by creating a sample multi-index DataFrame to demonstrate the query functionality.

arrays = [
    ['A', 'A', 'B', 'B'],
    [1, 2, 1, 2]
]
index = pd.MultiIndex.from_arrays(arrays, names=('letters', 'numbers'))
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [24, 28, 22, 35]
}
df_multi = pd.DataFrame(data, index=index)
print(df_multi)

Output:

               name  age
letters numbers        
A       1      Alice   24
        2        Bob   28
B       1    Charlie   22
        2      David   35

Querying on the First Level of Index

To perform a query on the first level of the index:

result_first_level = df_multi.query('letters == "A"')
print(result_first_level)

Output:

               name  age
letters numbers        
A       1      Alice   24
        2        Bob   28

This query filters the rows based on the first level of the index, which matches the letter “A”.

Querying on the Second Level of Index

Similarly, querying on the second level:

result_second_level = df_multi.query('numbers == 2')
print(result_second_level)

Output:

               name  age
letters numbers        
A       2        Bob   28
B       2      David   35

The output displays rows where the second index level has a value of 2.

Combining Queries Across Index Levels

You can also combine conditions across multiple index levels:

combined_result = df_multi.query('(letters == "A") & (numbers == 2)')
print(combined_result)

Output:

               name  age
letters numbers        
A       2        Bob   28

The query ensures that both conditions — first level index matching “A” and second level index matching 2 — are satisfied.

 

Using ‘in’ and ‘not in’ operators

The ‘in’ and ‘not in’ operators in the query() method allows you to filter DataFrame rows based on whether a column’s values belong to (or don’t belong to) a specific list of values

Using the ‘in’ Operator

The ‘in’ operator checks if the value of a column is among a list of specified values.

cities_to_filter = ["New York", "Houston"]
result_in = df.query('city in @cities_to_filter')
print(result_in)

Output:

    name  age      city
0  Alice   24  New York
3  David   35   Houston

Here, the ‘@’ symbol is used before the variable name cities_to_filter to indicate that it’s an external variable, and not a column name. We’ll explain external variables later in this tutorial.

The query fetches rows where the ‘city’ column’s value is either “New York” or “Houston”.

Using the ‘not in’ Operator

Conversely, the ‘not in’ operator checks if the value of a column is not among a list of specified values.

result_not_in = df.query('city not in @cities_to_filter')
print(result_not_in)

Output:

      name  age         city
1      Bob   28  Los Angeles
2  Charlie   22      Chicago
4      Eve   30      Phoenix

This time, the output consists of rows where the ‘city’ column’s value is anything but “New York” or “Houston”.

 

Querying for null or not null values

You can use isnull()to filter rows where a particular column has null values:

import numpy as np
new_row = pd.DataFrame({
    'name': ['Frank'],
    'age': [np.NaN],
    'city': ['Miami']
})
df_with_nulls = pd.concat([df, new_row], ignore_index=True)
result_null = df_with_nulls.query('age.isnull()')
print(result_null)

Output:

    name  age   city
5  Frank  NaN  Miami

As shown, the query retrieves rows where the ‘age’ column contains a NaN value.

Querying for Not Null Values

You can use notnull() to retrieve rows where a column has non-null values:

result_not_null = df_with_nulls.query('age.notnull()')
print(result_not_null)

Output:

      name   age         city
0    Alice  24.0     New York
1      Bob  28.0  Los Angeles
2  Charlie  22.0      Chicago
3    David  35.0      Houston
4      Eve  30.0      Phoenix

In this case, the output includes all rows except the ones where the ‘age’ column contains a NaN value.

 

Parameterizing Queries by Passing Variables

The query() method in Pandas allows you to integrate external variables into your query strings to make your queries more dynamic.

To reference an external variable inside your query() string, prefix it with the ‘@’ symbol.

name_to_filter = "Alice"
result = df.query('name == @name_to_filter')
print(result)

Output:

    name  age      city
0  Alice   24  New York

Here, the variable name_to_filter is used to filter rows where the ‘name’ column matches the value “Alice”.

Using Lists in Parameterized Queries

If you have a list of values and you’d like to filter based on them using the ‘in’ operator:

names_to_filter = ["Alice", "Bob"]
result_list = df.query('name in @names_to_filter')
print(result_list)

Output:

    name  age         city
0  Alice   24     New York
1    Bob   28  Los Angeles

Combining Multiple Variables in a Query

You can use multiple external variables in a single query:

name_to_filter = "Alice"
age_threshold = 25
result_combined = df.query('(name == @name_to_filter) & (age < @age_threshold)')
print(result_combined)

Output:

    name  age      city
0  Alice   24  New York

This query filters the DataFrame for rows where the ‘name’ is “Alice” and the ‘age’ is less than 25.

 

Chaining Queries

Suppose you have a dataset of people and you first want to filter those from “New York” and then, from that subset, select those over the age of 20. You can write two queries like this:

first_query = df.query('city == "New York"')
result = first_query.query('age > 20')
print(result)

For more readability, you can chain the queries together:

result = df.query('city == "New York"').query('age > 20')
print(result)

Output:

    name  age      city
0  Alice   24  New York

Using Intermediate Variables

You can use intermediate variables to make the code clearer. This approach improves readability and debugging:

new_york_residents = df.query('city == "New York"')
older_than_20 = new_york_residents.query('age > 20')
print(older_than_20)

Output:

    name  age      city
0  Alice   24  New York

Nested queries can lead to increased memory usage if not managed properly. Remember to release or del intermediate DataFrames if they’re no longer needed.

 

Querying Datetime Indexes

For demonstration, let’s create a sample DataFrame with a datetime index:

import pandas as pd
date_rng = pd.date_range(start='2022-01-01', end='2022-12-31', freq='M')
df_datetime = pd.DataFrame(date_rng, columns=['date'])
df_datetime['data'] = range(1, 13)  # Adding a column with range of numbers
df_datetime.set_index('date', inplace=True)
print(df_datetime)

Output:

            data
date            
2022-01-31     1
2022-02-28     2
...            ...
2022-12-31    12

Now, the DataFrame has a datetime index representing the end of each month in 2022.

To filter based on the datetime index, you can use the query() method like this:

result = df_datetime.query('"2022-06-01" <= index <= "2022-12-31"')
print(result)

Output:

            data
date            
2022-06-30     6
...            ...
2022-12-31    12

Here, the DataFrame is filtered to only include rows from June 2022 onwards.

Using Parameterized DateTime Queries

Combining the earlier discussion about parameterized queries with datetime indexes, you can make your datetime queries more dynamic:

start_date = "2022-03-01"
end_date = "2022-09-30"
result_parameterized = df_datetime.query('@start_date <= index <= @end_date')
print(result_parameterized)

Output:

            data
date            
2022-03-31     3
...            ...
2022-09-30     9

Querying on DateTime Attributes

You can filter data based on specific time attributes using datetime indexes like month, year, or weekday:

# Get data for all Mondays (weekday = 0)
mondays = df_datetime[df_datetime.index.weekday == 0]
print(mondays)

Output:

            data
date            
2022-01-31     1
2022-02-28     2
2022-10-31    10

This query retrieves rows where the index’s weekday is Monday.

 

Using Regex

Let’s create a sample DataFrame at the beginging:

import pandas as pd
data = {
    'product': ['apple', 'banana', 'cherry', 'apple pie', 'banana split', 'cherry pie', 'apple tart'],
    'price': [0.5, 0.3, 0.7, 2.5, 3.5, 2.8, 2.0]
}
df = pd.DataFrame(data)
print(df)

Output:

         product  price
0         apple    0.5
1        banana    0.3
...           ...
6     apple tart  2.0

You can leverage the power of regex patterns within your queries:

# Products ending with 'pie'
pie_products = df.query('product.str.contains("pie$")')
print(pie_products)

Output:

      product  price
3  apple pie    2.5
5  cherry pie  2.8

Read more about Using Regex in Pandas Query.

 

Troubleshooting Common Errors

The query() method in Pandas can produce errors when used incorrectly. Let’s walk through some common pitfalls and understand how to troubleshoot and rectify them.

Referencing Nonexistent Columns

One of the most frequent issues is referencing a column that doesn’t exist in the DataFrame.

Example:

import pandas as pd
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})

# Trying to filter by a non-existent column 'height'
result = df.query('height > 170')

Error:

KeyError: 'height'

Solution: Always ensure that the columns you reference in your query are present in the DataFrame. Use the columns attribute of the DataFrame to check:

print(df.columns)

Using Invalid Operators or Syntax

The query() method is reliant on a correct syntax. If you use an operator that’s not supported or make a syntax error, you’ll encounter an error.

Example:

# Using a non-supported operator such as '==='
result = df.query('age === 25')

Error:

SyntaxError: ... syntax error ...

Solution: Ensure that you are using the correct operators and syntax is supported by the query() method. In this case, the correct operator is ==.

Handling Ambiguous Syntax

Certain column names might interfere with Python’s reserved keywords or cause ambiguity.

Example:

df_ambiguous = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'in': [1, 2, 3]
})

result = df_ambiguous.query('in == 2')

Error:

SyntaxError: Python keyword not valid identifier in numexpr query

Solution: For such column names, wrap them in backticks:

result = df_ambiguous.query('`in` == 2')
print(result)

Output:

  name  in
1  Bob   2

 

Resource

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Leave a Reply

Your email address will not be published. Required fields are marked *