Filter Pandas DataFrame using query() with data in range

The query method in Pandas allows you to perform queries on your DataFrame.

In this tutorial, you’ll learn how to filter DataFrame numeric columns by integer ranges, date ranges, or floating-point value ranges and combine logical operators like and/or/not to construct complex query expressions.

 

 

Querying Numeric Data Ranges

In this section, we’ll learn how to query numeric data ranges using integers, floating-point numbers.

Querying Integers Within a Range

First, let’s load some sample data into a DataFrame.

import pandas as pd
data = {
    'user_id': [101, 102, 103, 104, 105],
    'monthly_usage_gb': [20, 50, 30, 40, 25],
    'connections': [5, 10, 3, 7, 4]
}
df = pd.DataFrame(data)

To filter this DataFrame based on a range of monthly usage values, you can use the query method like this:

result = df.query('20 <= monthly_usage_gb <= 40')
print(result)

Output

   user_id  monthly_usage_gb  connections
0      101                20            5
2      103                30            3
3      104                40            7
4      105                25            4

Querying Floating-Point Numbers Within a Range

Suppose we have another column in our DataFrame that stores floating-point numbers, like customer ratings.

# Adding a 'rating' column with floating-point numbers
df['rating'] = [4.5, 3.8, 4.1, 4.6, 4.2]

To filter based on a range of ratings, you can use the query method as follows:

result = df.query('4.0 <= rating <= 4.5')
print(result)

Output

   user_id  monthly_usage_gb  connections  rating
0      101                20            5     4.5
2      103                30            3     4.1
4      105                25            4     4.2

 

Querying Date Ranges

If our DataFrame includes a column with date information stored as strings, you’ll need to convert this to DateTime objects to perform date-based queries effectively.

Here’s how to add a ‘last_payment_date’ column with date strings to our existing DataFrame:

df['last_payment_date'] = ['2022-01-15', '2022-02-01', '2022-01-22', '2022-01-30', '2022-02-05']

To convert these strings to DateTime objects:

df['last_payment_date'] = pd.to_datetime(df['last_payment_date'])

Once you have DateTime objects, you can query the DataFrame for rows falling within a specific date range.

For example, to find all the records with ‘last_payment_date’ in January 2022:

result = df.query("'2022-01-01' <= last_payment_date <= '2022-01-31'")
print(result)

Output

   user_id  monthly_usage_gb  connections  rating last_payment_date
0      101                20            5     4.5       2022-01-15
2      103                30            3     4.1       2022-01-22
3      104                40            7     4.6       2022-01-30

 

Using Date Offsets

Pandas also allows you to use date offsets for more dynamic querying.

For instance, you can find all records where the ‘last_payment_date’ was within the last 7 days:

from pandas.tseries.offsets import DateOffset
today = pd.Timestamp('2022-02-06')  # Simulating today's date
seven_days_ago = today - DateOffset(days=7)
result = df.query("@seven_days_ago <= last_payment_date <= @today")
print(result)

Output

   user_id  monthly_usage_gb  connections  rating last_payment_date
1      102                50           10     3.8        2022-02-01
3      104                40            7     4.6        2022-01-30
4      105                25            4     4.2        2022-02-05

The output DataFrame shows the rows where the last_payment_date falls within the last 7 days, based on the simulated current date.

 

Using and, or, not to Combine Range Queries

You can use logical operators and, or, and not to combine multiple range queries

Using (and)

Let’s filter the DataFrame for rows where the monthly_usage_gb is between 20 and 40, and the last_payment_date is in January 2022.

result = df.query("(20 <= monthly_usage_gb <= 40) and ('2022-01-01' <= last_payment_date <= '2022-01-31')")
print(result)

Output

   user_id  monthly_usage_gb  connections  rating last_payment_date
0      101                20            5     4.5       2022-01-15
2      103                30            3     4.1       2022-01-22
3      104                40            7     4.6       2022-01-30

 

Using (or)

The or operator lets you select rows that satisfy at least one of multiple conditions.

For example, to get rows where either the monthly_usage_gb is between 20 and 40 or the rating is above 4.5:

result = df.query("(20 <= monthly_usage_gb <= 40) or (rating > 4.5)")
print(result)

Output

   user_id  monthly_usage_gb  connections  rating last_payment_date
0      101                20            5     4.5       2022-01-15
2      103                30            3     4.1       2022-01-22
3      104                40            7     4.6       2022-01-30
4      105                25            4     4.2       2022-02-05

 

Using (not)

The not operator can be used to filter out rows that meet a particular condition. Let’s say you want to find all records where the monthly_usage_gb is not between 30 and 40:

result = df.query("not (30 <= monthly_usage_gb <= 40)")
print(result)

Output

   user_id  monthly_usage_gb  connections  rating last_payment_date
0      101                20            5     4.5       2022-01-15
1      102                50           10     3.8       2022-02-01
4      105                25            4     4.2       2022-02-05

The output DataFrame excludes rows where monthly_usage_gb is between 30 and 40.

Leave a Reply

Your email address will not be published. Required fields are marked *