Filter Pandas DataFrame using query() with data in range
The query
method in Pandas allows you to perform queries on your DataFrame.
In this tutorial, you’ll learn how to filter DataFrame numeric columns by integer ranges, date ranges, or floating-point value ranges and combine logical operators like and/or/not to construct complex query expressions.
Querying Numeric Data Ranges
In this section, we’ll learn how to query numeric data ranges using integers, floating-point numbers.
Querying Integers Within a Range
First, let’s load some sample data into a DataFrame.
import pandas as pd data = { 'user_id': [101, 102, 103, 104, 105], 'monthly_usage_gb': [20, 50, 30, 40, 25], 'connections': [5, 10, 3, 7, 4] } df = pd.DataFrame(data)
To filter this DataFrame based on a range of monthly usage values, you can use the query
method like this:
result = df.query('20 <= monthly_usage_gb <= 40') print(result)
Output
user_id monthly_usage_gb connections 0 101 20 5 2 103 30 3 3 104 40 7 4 105 25 4
Querying Floating-Point Numbers Within a Range
Suppose we have another column in our DataFrame that stores floating-point numbers, like customer ratings.
# Adding a 'rating' column with floating-point numbers df['rating'] = [4.5, 3.8, 4.1, 4.6, 4.2]
To filter based on a range of ratings, you can use the query
method as follows:
result = df.query('4.0 <= rating <= 4.5') print(result)
Output
user_id monthly_usage_gb connections rating 0 101 20 5 4.5 2 103 30 3 4.1 4 105 25 4 4.2
Querying Date Ranges
If our DataFrame includes a column with date information stored as strings, you’ll need to convert this to DateTime objects to perform date-based queries effectively.
Here’s how to add a ‘last_payment_date’ column with date strings to our existing DataFrame:
df['last_payment_date'] = ['2022-01-15', '2022-02-01', '2022-01-22', '2022-01-30', '2022-02-05']
To convert these strings to DateTime objects:
df['last_payment_date'] = pd.to_datetime(df['last_payment_date'])
Once you have DateTime objects, you can query the DataFrame for rows falling within a specific date range.
For example, to find all the records with ‘last_payment_date’ in January 2022:
result = df.query("'2022-01-01' <= last_payment_date <= '2022-01-31'") print(result)
Output
user_id monthly_usage_gb connections rating last_payment_date 0 101 20 5 4.5 2022-01-15 2 103 30 3 4.1 2022-01-22 3 104 40 7 4.6 2022-01-30
Using Date Offsets
Pandas also allows you to use date offsets for more dynamic querying.
For instance, you can find all records where the ‘last_payment_date’ was within the last 7 days:
from pandas.tseries.offsets import DateOffset today = pd.Timestamp('2022-02-06') # Simulating today's date seven_days_ago = today - DateOffset(days=7) result = df.query("@seven_days_ago <= last_payment_date <= @today") print(result)
Output
user_id monthly_usage_gb connections rating last_payment_date 1 102 50 10 3.8 2022-02-01 3 104 40 7 4.6 2022-01-30 4 105 25 4 4.2 2022-02-05
The output DataFrame shows the rows where the last_payment_date
falls within the last 7 days, based on the simulated current date.
Using and, or, not to Combine Range Queries
You can use logical operators and
, or
, and not
to combine multiple range queries
Using (and)
Let’s filter the DataFrame for rows where the monthly_usage_gb
is between 20 and 40, and the last_payment_date
is in January 2022.
result = df.query("(20 <= monthly_usage_gb <= 40) and ('2022-01-01' <= last_payment_date <= '2022-01-31')") print(result)
Output
user_id monthly_usage_gb connections rating last_payment_date 0 101 20 5 4.5 2022-01-15 2 103 30 3 4.1 2022-01-22 3 104 40 7 4.6 2022-01-30
Using (or)
The or
operator lets you select rows that satisfy at least one of multiple conditions.
For example, to get rows where either the monthly_usage_gb
is between 20 and 40 or the rating
is above 4.5:
result = df.query("(20 <= monthly_usage_gb <= 40) or (rating > 4.5)") print(result)
Output
user_id monthly_usage_gb connections rating last_payment_date 0 101 20 5 4.5 2022-01-15 2 103 30 3 4.1 2022-01-22 3 104 40 7 4.6 2022-01-30 4 105 25 4 4.2 2022-02-05
Using (not)
The not
operator can be used to filter out rows that meet a particular condition. Let’s say you want to find all records where the monthly_usage_gb
is not between 30 and 40:
result = df.query("not (30 <= monthly_usage_gb <= 40)") print(result)
Output
user_id monthly_usage_gb connections rating last_payment_date 0 101 20 5 4.5 2022-01-15 1 102 50 10 3.8 2022-02-01 4 105 25 4 4.2 2022-02-05
The output DataFrame excludes rows where monthly_usage_gb
is between 30 and 40.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.