Pandas Query Method: Filter DataFrames Efficiently
Pandas offers the query()
method that allows users to query DataFrames using string expressions.
In this tutorial, you’ll learn about the capabilities of the query
method and how to use it in real-world applications.
- 1 Pandas query Syntax and Parameters
- 2 Filter a DataFrame Using query Method
- 3 Using Comparison Operators
- 4 Using Multiple Conditions
- 5 In-place Modification
- 6 Using String Methods
- 7 Querying with Index
- 8 Query method with Multi-index DataFrames
- 9 Using ‘in’ and ‘not in’ operators
- 10 Querying for null or not null values
- 11 Parameterizing Queries by Passing Variables
- 12 Chaining Queries
- 13 Querying Datetime Indexes
- 14 Using Regex
- 15 Troubleshooting Common Errors
- 16 Resource
Pandas query Syntax and Parameters
The syntax for the query()
method is as follows:
DataFrame.query(expr, inplace=False, **kwargs)
- expr: A string expression, which is similar to the expressions you’d use in SQL. It’s the condition or set of conditions you’d like to apply to the DataFrame.
- inplace: A boolean (default is False). If True, it modifies the DataFrame in place and returns None. Otherwise, it returns a new DataFrame that’s been filtered.
- kwargs: Other keyword arguments that can be passed, especially useful for more complex queries. You can often use these to refer to external variables, which are sometimes easier to plug into your query strings.
For instance, if you want to query a DataFrame based on values in a column named ‘age’, you could use an expression like age > 25
.
Filter a DataFrame Using query Method
To understand the power and simplicity of the query()
method, let’s start by creating a DataFrame using Pandas.
import pandas as pd data = { 'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'], 'age': [24, 28, 22, 35, 30], 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'] } df = pd.DataFrame(data) print(df)
Output:
name age city 0 Alice 24 New York 1 Bob 28 Los Angeles 2 Charlie 22 Chicago 3 David 35 Houston 4 Eve 30 Phoenix
Now, let’s use the query()
method to filter out rows where the age is greater than 25.
result = df.query('age > 25') print(result)
Output:
name age city 1 Bob 28 Los Angeles 3 David 35 Houston 4 Eve 30 Phoenix
From the output, you can observe that only the rows where the age exceeds 25 are displayed.
Using Comparison Operators
You can use the comparison operator with the query()
method to filter your data.
Equal To (==)
To filter rows based on exact matches, use the ==
operator.
result_equal = df.query('city == "New York"') print(result_equal)
Output:
name age city 0 Alice 24 New York
In this case, only the row where the city is “New York” is displayed.
Not Equal To (!=)
To filter out rows based on non-matching criteria, use the !=
operator.
result_not_equal = df.query('city != "New York"') print(result_not_equal)
Output:
name age city 1 Bob 28 Los Angeles 2 Charlie 22 Chicago 3 David 35 Houston 4 Eve 30 Phoenix
Here, all rows except the one where the city is “New York” are shown.
Greater Than (>) and Less Than (<)
We’ve already explored the >
operator at the beginning of the tutorial. Let’s now retrieve rows where the age is less than 30.
result_less = df.query('age < 30') print(result_less)
Output:
name age city 0 Alice 24 New York 1 Bob 28 Los Angeles 2 Charlie 22 Chicago
This output showcases individuals aged less than 30.
Using Multiple Conditions
You can use multiple conditions with the query()
method to narrow down your results or to broaden the scope of your data filtering.
And (&)
To meet multiple conditions simultaneously, use the &
operator.
result_and = df.query('(age > 25) & (city == "Los Angeles")') print(result_and)
Output:
name age city 1 Bob 28 Los Angeles
Here, only the rows meeting both criteria — age greater than 25 and city being “Los Angeles” — are displayed.
Remember, when combining conditions with the &
operator, it’s crucial to enclose individual conditions within parentheses. This ensures that they are evaluated correctly.
Or (|)
You can use the |
operator to retrieve rows that satisfy at least one of several conditions:
result_or = df.query('(age < 25) | (city == "Houston")') print(result_or)
Output:
name age city 0 Alice 24 New York 2 Charlie 22 Chicago 3 David 35 Houston
This query fetches rows where the individual is either aged below 25 or lives in Houston.
Not (~)
To invert the result of a condition, use the ~
operator.
result_not = df.query('~(city == "New York")') print(result_not)
Output:
name age city 1 Bob 28 Los Angeles 2 Charlie 22 Chicago 3 David 35 Houston 4 Eve 30 Phoenix
Here, you get rows where the city is not “New York”.
In-place Modification
The inplace
parameter determines whether the DataFrame is modified in place or if a new DataFrame is returned.
By default, inplace
is set to False, which means the original DataFrame remains unchanged and a new filtered DataFrame is returned.
# Using the previously defined 'df' result = df.query('age > 25') print("Filtered DataFrame:\n", result) print("\nOriginal DataFrame:\n", df)
Output:
Filtered DataFrame: name age city 1 Bob 28 Los Angeles 3 David 35 Houston 4 Eve 30 Phoenix Original DataFrame: name age city 0 Alice 24 New York 1 Bob 28 Los Angeles 2 Charlie 22 Chicago 3 David 35 Houston 4 Eve 30 Phoenix
As you can see, the original DataFrame remains unchanged.
Modifying In-place (inplace=True)
If you set the inplace
parameter to True, the original DataFrame will be directly modified.
This is memory efficient, especially with large DataFrames, as no new object is created.
df.query('age > 25', inplace=True) print(df)
Output:
name age city 1 Bob 28 Los Angeles 3 David 35 Houston 4 Eve 30 Phoenix
Here, the DataFrame ‘df’ has been directly modified to reflect only those rows where the age is greater than 25.
Always proceed with caution when using inplace=True
as it makes irreversible changes to your original data.
It’s good practice to ensure you have backups or are certain about the modifications.
Using String Methods
The query()
method offers some string methods for text-based querying.
Checking the Presence of Substring with .str.contains()
You can use the .str.contains()
method to filter rows based on the presence of a particular substring within a string column:
result_contains = df.query('city.str.contains("New")') print(result_contains)
Output:
name age city 0 Alice 24 New York
In the example above, the query retrieves rows where the ‘city’ column contains the substring “New”.
Case-Insensitive Queries
You can use the case
parameter to make queries case insensitive:
result_case_insensitive = df.query('city.str.contains("new", case=False)') print(result_case_insensitive)
Output:
name age city 0 Alice 24 New York
Even though “new” is in lowercase, the query successfully matches “New York”.
Matching Start or End of Strings
These string methods allow you to match substrings at the beginning or end of your data respectively.
result_start = df.query('name.str.startswith("A")') print(result_start) result_end = df.query('name.str.endswith("e")') print(result_end)
Output:
name age city 0 Alice 24 New York --------- name age city 0 Alice 24 New York 2 Charlie 22 Chicago 4 Eve 30 Phoenix
Here, the first query fetches rows where the name starts with the letter “A” while the second query fetches rows where the name ends with the letter “e”.
Read more about using string methods in Pandas query().
Querying with Index
You can use the index
keyword in the query()
method if your DataFrame has a default integer index.
result_index = df.query('index > 2') print(result_index)
Output:
name age city 3 David 35 Houston 4 Eve 30 Phoenix
The output showcases rows where the index value exceeds 2.
Querying with Custom Indexes
If the index is not an integer, you can use the same index
keyword to perform queries.
df_custom_index = df.set_index('name') result_custom_index = df_custom_index.query('index == "Alice"') print(result_custom_index)
Output:
age city name Alice 24 New York
The query then retrieves the row with the index value “Alice”.
Combining Column and Index Queries
You can also combine queries on the index and columns to refine your data extraction process further.
result_combined = df_custom_index.query('(index == "Alice") or (age > 30)') print(result_combined)
Output:
age city name Alice 24 New York David 35 Houston
The resulting DataFrame displays rows where either the name is “Alice” or the age is greater than 30.
Query method with Multi-index DataFrames
Let’s start by creating a sample multi-index DataFrame to demonstrate the query functionality.
arrays = [ ['A', 'A', 'B', 'B'], [1, 2, 1, 2] ] index = pd.MultiIndex.from_arrays(arrays, names=('letters', 'numbers')) data = { 'name': ['Alice', 'Bob', 'Charlie', 'David'], 'age': [24, 28, 22, 35] } df_multi = pd.DataFrame(data, index=index) print(df_multi)
Output:
name age letters numbers A 1 Alice 24 2 Bob 28 B 1 Charlie 22 2 David 35
Querying on the First Level of Index
To perform a query on the first level of the index:
result_first_level = df_multi.query('letters == "A"') print(result_first_level)
Output:
name age letters numbers A 1 Alice 24 2 Bob 28
This query filters the rows based on the first level of the index, which matches the letter “A”.
Querying on the Second Level of Index
Similarly, querying on the second level:
result_second_level = df_multi.query('numbers == 2') print(result_second_level)
Output:
name age letters numbers A 2 Bob 28 B 2 David 35
The output displays rows where the second index level has a value of 2.
Combining Queries Across Index Levels
You can also combine conditions across multiple index levels:
combined_result = df_multi.query('(letters == "A") & (numbers == 2)') print(combined_result)
Output:
name age letters numbers A 2 Bob 28
The query ensures that both conditions — first level index matching “A” and second level index matching 2 — are satisfied.
Using ‘in’ and ‘not in’ operators
The ‘in’ and ‘not in’ operators in the query()
method allows you to filter DataFrame rows based on whether a column’s values belong to (or don’t belong to) a specific list of values
Using the ‘in’ Operator
The ‘in’ operator checks if the value of a column is among a list of specified values.
cities_to_filter = ["New York", "Houston"] result_in = df.query('city in @cities_to_filter') print(result_in)
Output:
name age city 0 Alice 24 New York 3 David 35 Houston
Here, the ‘@’ symbol is used before the variable name cities_to_filter
to indicate that it’s an external variable, and not a column name. We’ll explain external variables later in this tutorial.
The query fetches rows where the ‘city’ column’s value is either “New York” or “Houston”.
Using the ‘not in’ Operator
Conversely, the ‘not in’ operator checks if the value of a column is not among a list of specified values.
result_not_in = df.query('city not in @cities_to_filter') print(result_not_in)
Output:
name age city 1 Bob 28 Los Angeles 2 Charlie 22 Chicago 4 Eve 30 Phoenix
This time, the output consists of rows where the ‘city’ column’s value is anything but “New York” or “Houston”.
Querying for null or not null values
You can use isnull()
to filter rows where a particular column has null values:
import numpy as np new_row = pd.DataFrame({ 'name': ['Frank'], 'age': [np.NaN], 'city': ['Miami'] }) df_with_nulls = pd.concat([df, new_row], ignore_index=True) result_null = df_with_nulls.query('age.isnull()') print(result_null)
Output:
name age city 5 Frank NaN Miami
As shown, the query retrieves rows where the ‘age’ column contains a NaN
value.
Querying for Not Null Values
You can use notnull()
to retrieve rows where a column has non-null values:
result_not_null = df_with_nulls.query('age.notnull()') print(result_not_null)
Output:
name age city 0 Alice 24.0 New York 1 Bob 28.0 Los Angeles 2 Charlie 22.0 Chicago 3 David 35.0 Houston 4 Eve 30.0 Phoenix
In this case, the output includes all rows except the ones where the ‘age’ column contains a NaN
value.
Parameterizing Queries by Passing Variables
The query()
method in Pandas allows you to integrate external variables into your query strings to make your queries more dynamic.
To reference an external variable inside your query()
string, prefix it with the ‘@’ symbol.
name_to_filter = "Alice" result = df.query('name == @name_to_filter') print(result)
Output:
name age city 0 Alice 24 New York
Here, the variable name_to_filter
is used to filter rows where the ‘name’ column matches the value “Alice”.
Using Lists in Parameterized Queries
If you have a list of values and you’d like to filter based on them using the ‘in’ operator:
names_to_filter = ["Alice", "Bob"] result_list = df.query('name in @names_to_filter') print(result_list)
Output:
name age city 0 Alice 24 New York 1 Bob 28 Los Angeles
Combining Multiple Variables in a Query
You can use multiple external variables in a single query:
name_to_filter = "Alice" age_threshold = 25 result_combined = df.query('(name == @name_to_filter) & (age < @age_threshold)') print(result_combined)
Output:
name age city 0 Alice 24 New York
This query filters the DataFrame for rows where the ‘name’ is “Alice” and the ‘age’ is less than 25.
Chaining Queries
Suppose you have a dataset of people and you first want to filter those from “New York” and then, from that subset, select those over the age of 20. You can write two queries like this:
first_query = df.query('city == "New York"') result = first_query.query('age > 20') print(result)
For more readability, you can chain the queries together:
result = df.query('city == "New York"').query('age > 20') print(result)
Output:
name age city 0 Alice 24 New York
Using Intermediate Variables
You can use intermediate variables to make the code clearer. This approach improves readability and debugging:
new_york_residents = df.query('city == "New York"') older_than_20 = new_york_residents.query('age > 20') print(older_than_20)
Output:
name age city 0 Alice 24 New York
Nested queries can lead to increased memory usage if not managed properly. Remember to release or del intermediate DataFrames if they’re no longer needed.
Querying Datetime Indexes
For demonstration, let’s create a sample DataFrame with a datetime index:
import pandas as pd date_rng = pd.date_range(start='2022-01-01', end='2022-12-31', freq='M') df_datetime = pd.DataFrame(date_rng, columns=['date']) df_datetime['data'] = range(1, 13) # Adding a column with range of numbers df_datetime.set_index('date', inplace=True) print(df_datetime)
Output:
data date 2022-01-31 1 2022-02-28 2 ... ... 2022-12-31 12
Now, the DataFrame has a datetime index representing the end of each month in 2022.
To filter based on the datetime index, you can use the query()
method like this:
result = df_datetime.query('"2022-06-01" <= index <= "2022-12-31"') print(result)
Output:
data date 2022-06-30 6 ... ... 2022-12-31 12
Here, the DataFrame is filtered to only include rows from June 2022 onwards.
Using Parameterized DateTime Queries
Combining the earlier discussion about parameterized queries with datetime indexes, you can make your datetime queries more dynamic:
start_date = "2022-03-01" end_date = "2022-09-30" result_parameterized = df_datetime.query('@start_date <= index <= @end_date') print(result_parameterized)
Output:
data date 2022-03-31 3 ... ... 2022-09-30 9
Querying on DateTime Attributes
You can filter data based on specific time attributes using datetime indexes like month, year, or weekday:
# Get data for all Mondays (weekday = 0) mondays = df_datetime[df_datetime.index.weekday == 0] print(mondays)
Output:
data date 2022-01-31 1 2022-02-28 2 2022-10-31 10
This query retrieves rows where the index’s weekday is Monday.
Using Regex
Let’s create a sample DataFrame at the beginging:
import pandas as pd data = { 'product': ['apple', 'banana', 'cherry', 'apple pie', 'banana split', 'cherry pie', 'apple tart'], 'price': [0.5, 0.3, 0.7, 2.5, 3.5, 2.8, 2.0] } df = pd.DataFrame(data) print(df)
Output:
product price 0 apple 0.5 1 banana 0.3 ... ... 6 apple tart 2.0
You can leverage the power of regex patterns within your queries:
# Products ending with 'pie' pie_products = df.query('product.str.contains("pie$")') print(pie_products)
Output:
product price 3 apple pie 2.5 5 cherry pie 2.8
Read more about Using Regex in Pandas Query.
Troubleshooting Common Errors
The query()
method in Pandas can produce errors when used incorrectly. Let’s walk through some common pitfalls and understand how to troubleshoot and rectify them.
Referencing Nonexistent Columns
One of the most frequent issues is referencing a column that doesn’t exist in the DataFrame.
Example:
import pandas as pd df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35] }) # Trying to filter by a non-existent column 'height' result = df.query('height > 170')
Error:
KeyError: 'height'
Solution: Always ensure that the columns you reference in your query are present in the DataFrame. Use the columns
attribute of the DataFrame to check:
print(df.columns)
Using Invalid Operators or Syntax
The query()
method is reliant on a correct syntax. If you use an operator that’s not supported or make a syntax error, you’ll encounter an error.
Example:
# Using a non-supported operator such as '===' result = df.query('age === 25')
Error:
SyntaxError: ... syntax error ...
Solution: Ensure that you are using the correct operators and syntax is supported by the query()
method. In this case, the correct operator is ==
.
Handling Ambiguous Syntax
Certain column names might interfere with Python’s reserved keywords or cause ambiguity.
Example:
df_ambiguous = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'in': [1, 2, 3] }) result = df_ambiguous.query('in == 2')
Error:
SyntaxError: Python keyword not valid identifier in numexpr query
Solution: For such column names, wrap them in backticks:
result = df_ambiguous.query('`in` == 2') print(result)
Output:
name in 1 Bob 2
Resource
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.