5 Ways to Limit Results from Pandas query Method

The Pandas query method allows you to filter data based on conditions.

Sometimes you only need a subset or limited number of results.

In this tutorial, you’ll learn how to limit results from the query method.

 

 

Using head() and tail()

One straightforward way to limit the number of rows returned after using the query method is by chaining it with head() or tail().

These methods allow you to grab the first or last ‘n’ rows of a DataFrame, respectively.

You want to find out the first 3 customers who have exceeded a call duration of 20 minutes.

import pandas as pd
data = {
    'customer_id': [1, 2, 3, 4, 5, 6],
    'call_duration': [12, 25, 18, 22, 33, 28],
    'state': ['NY', 'CA', 'FL', 'TX', 'CA', 'NY']
}
df = pd.DataFrame(data)
result = df.query('call_duration > 20').head(3)
print(result)

Output:

   customer_id  call_duration state
1            2             25    CA
3            4             22    TX
4            5             33    CA

You can also get the last 3 customers who have exceeded a call duration of 20 minutes:

result_tail = df.query('call_duration > 20').tail(3)
print(result_tail)

Output:

   customer_id  call_duration state
3            4             22    TX
4            5             33    CA
5            6             28    NY

 

Limiting Rows with DataFrame.iloc[]

Another way to limit the number of rows after performing a query is using the iloc[] property.

For instance, let’s say you want to get customers positioned from index 2 to 4 who have exceeded a call duration of 20 minutes.

result = df.query('call_duration > 20').iloc[2:5]
print(result)

Output:

   customer_id  call_duration state
4            5             33    CA
5            6             28    NY

 

Limiting Columns with DataFrame.loc[]

Sometimes, you may want not only to limit rows but also select only a subset of columns for your analysis.

Suppose you are interested in only the customer_id and state columns for customers who have a call duration of more than 20 minutes. Here’s how you can achieve this.

result = df.query('call_duration > 20').loc[:, ['customer_id', 'state']]
print(result)

Output:

   customer_id state
1            2    CA
3            4    TX
4            5    CA
5            6    NY

In this example, the loc[:, ['customer_id', 'state']] piece of the code specifies that you want all rows (represented by :) but only the customer_id and state columns from the DataFrame that satisfies the query condition.

 

Return Percentage of Results

The sample method allows you to limit the output based on a percentage of the total results rather than a fixed number of rows.

For example, let’s say you want to get a random 50% of customers who have a call duration greater than 20 minutes.

result = df.query('call_duration > 20').sample(frac=0.5)
print(result)

Output:

Note: The output may vary each time you run the code.
   customer_id  call_duration state
4            5             33    CA
5            6             28    NY

In this output, the sample(frac=0.5) method returns random 50% of the rows that meet the condition call_duration > 20.

 

Using the nlargest() and nsmallest()

You can use the nlargest() and nsmallest() methods if you want to get the ‘n’ largest or smallest values, respectively, based on a specific column.

For instance, to find the top 3 customers with the longest call duration:

# Using query with nlargest
result = df.query('call_duration > 20').nlargest(3, 'call_duration')
print(result)

Output:

   customer_id  call_duration state
4            5             33    CA
5            6             28    NY
1            2             25    CA

Similarly, to find the bottom 3 customers who barely exceed a 20-minute call duration:

# Using query with nsmallest
result = df.query('call_duration > 20').nsmallest(3, 'call_duration')
print(result)

Output:

   customer_id  call_duration state
3            4             22    TX
1            2             25    CA
5            6             28    NY

The nlargest() and nsmallest() methods provide an effective way to isolate the highest or lowest values in your dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *