Filter DataFrames Using Pandas query method chaining
Method chaining allows you to perform multiple operations in a single expression.
This tutorial focuses on integrating the Pandas query
method into your method chaining workflow to filter DataFrames.
- 1 Using query Method with pipe()
- 2 Method Chaining with .loc[] and .iloc[]
- 3 Using .assign() and Chaining to Add New Columns
- 4 Chaining Query Filters with groupby() and Aggregation
- 5 Chain query with xs() for MultiIndexes
- 6 Chain Custom Functions using apply() and transform()
- 7 Chained with .join() and .merge()
Using query Method with pipe()
The pipe
method in Pandas allows you to chain together multiple operations.
The advantage of using pipe
is its ability to accept user-defined functions.
You can integrate the query
method within a pipe
to perform complex filtering and transformation steps in one go.
Here’s an example:
import pandas as pd df = pd.DataFrame({ 'ID': [101, 102, 103, 104, 105], 'Name': ['John', 'Sara', 'Mike', 'Emily', 'Tom'], 'Age': [34, 45, 22, 29, 31], 'Plan': ['Basic', 'Premium', 'Basic', 'Standard', 'Premium'], 'Amount': [20, 50, 20, 30, 50] }) # Define a function for age filter def filter_by_age(df, min_age, max_age): return df.query(f"{min_age} <= Age <= {max_age}") # Chain multiple operations using .pipe() result = (df .pipe(filter_by_age, min_age=30, max_age=40) .query("Plan == 'Premium'") .sort_values('Amount') ) print(result)
Output:
ID Name Age Plan Amount 4 105 Tom 31 Premium 50
In the output, you’ll notice that the DataFrame has been filtered to include only those records where the age is between 30 and 40, and the plan is ‘Premium’.
Additionally, the records are sorted by the ‘Amount’ column.
Method Chaining with .loc[] and .iloc[]
While .loc[]
and .iloc[]
are primarily used for label-based and integer-based indexing, you can combine them with the query
method for even more powerful filtering.
Let’s see an example to understand this better:
import pandas as pd df = pd.DataFrame({ 'ID': [201, 202, 203, 204, 205], 'Name': ['Anna', 'Sophie', 'Sam', 'Nina', 'Robert'], 'Age': [25, 33, 41, 52, 29], 'Plan': ['Basic', 'Standard', 'Premium', 'Basic', 'Standard'], 'Amount': [10, 30, 45, 15, 30] }) # Chain multiple operations using .query(), .loc[], and .iloc[] result = (df .query("Plan == 'Standard' or Plan == 'Basic'") .loc[:, ['ID', 'Name', 'Plan']] .iloc[1:4] ) print(result)
Output:
ID Name Plan 1 202 Sophie Standard 3 204 Nina Basic
In the output, the DataFrame is first filtered to include only the rows with plans ‘Standard’ or ‘Basic’.
Then, the .loc[]
method selects only the ‘ID’, ‘Name’, and ‘Plan’ columns. Finally, .iloc[1:4]
slices the rows from index 1 to 3.
Using .assign() and Chaining to Add New Columns
The assign
method in Pandas allows you to add one or multiple new columns to a DataFrame.
What makes assign
even more powerful is that it can be integrated into a method chaining operation.
Let’s go through an example to illustrate this:
import pandas as pd df = pd.DataFrame({ 'ID': [401, 402, 403, 404, 405], 'Name': ['George', 'Lucas', 'Grace', 'Sophia', 'Jack'], 'Age': [25, 27, 31, 28, 24], 'Plan': ['Premium', 'Basic', 'Standard', 'Basic', 'Premium'], 'Amount': [45, 20, 30, 20, 45] }) # Add new columns using .assign() in a method chain result = (df .query("Age > 25") .assign( Total_Amount=lambda x: x['Amount'] * 12, Is_Premium=lambda x: x['Plan'] == 'Premium' ) ) print(result)
Output:
ID Name Age Plan Amount Total_Amount Is_Premium 1 402 Lucas 27 Basic 20 240 False 2 403 Grace 31 Standard 30 360 False 3 404 Sophia 28 Basic 20 240 False
The output shows that the DataFrame has been filtered to include only the rows where the age is greater than 25.
Two new columns have been added: Total_Amount
and Is_Premium
. The Total_Amount
column is calculated by multiplying the Amount
by 12, and Is_Premium
indicates whether the plan is ‘Premium’.
Chaining Query Filters with groupby() and Aggregation
You can use .groupby()
in the chain to aggregate your data based on specific conditions.
import pandas as pd df = pd.DataFrame({ 'ID': [501, 502, 503, 504, 505], 'Name': ['Catherine', 'James', 'Harry', 'Irene', 'Oliver'], 'Age': [36, 28, 33, 41, 38], 'Plan': ['Standard', 'Basic', 'Standard', 'Premium', 'Standard'], 'Amount': [30, 20, 30, 45, 30] }) # Chain query filters with .groupby() to aggregate data result = (df .query("Plan != 'Basic'") .groupby('Plan') .agg( Avg_Age=('Age', 'mean'), Total_Amount=('Amount', 'sum') ) .reset_index() ) print(result)
Output:
Plan Avg_Age Total_Amount 0 Premium 41.000000 45 1 Standard 35.666667 90
In the output, the DataFrame is first filtered to exclude rows with the ‘Basic’ plan. Following this, the .groupby()
method aggregates the data by the ‘Plan’ column.
The aggregation calculates the average age and total amount for each plan.
Lastly, .reset_index()
ensures that the output DataFrame has a clean index.
Chain query with xs() for MultiIndexes
The .xs()
method in Pandas allows you to select data across multi-level indexes.
The best part? You can integrate .xs()
in a method chain along with the query
method to perform in-depth, layered filtering.
import pandas as pd arrays = [ ['A', 'A', 'B', 'B', 'C', 'C'], [1, 2, 1, 2, 1, 2] ] index = pd.MultiIndex.from_arrays(arrays, names=('Letters', 'Numbers')) df = pd.DataFrame({ 'ID': [601, 602, 603, 604, 605, 606], 'Name': ['Victor', 'Jenny', 'Lily', 'Ben', 'Claire', 'Dan'], 'Age': [29, 43, 35, 29, 38, 27], 'Amount': [30, 45, 30, 20, 35, 25] }, index=index) # Chain query filters with .xs() for MultiIndexes result = (df .query("Age >= 30") .xs(key=1, level='Numbers', axis=0) .sort_values('Age') ) print(result)
Output:
ID Name Age Amount Letters B 603 Lily 35 30 C 605 Claire 38 35
In this output, the DataFrame is filtered to include only the rows where the age is 30 or above.
Then, the .xs()
method selects the rows that correspond to the first level (‘Numbers’) being 1.
Lastly, the sort_values('Age')
sorts these filtered rows based on age.
Chain Custom Functions using apply() and transform()
The .apply()
and .transform()
methods are used to apply functions across DataFrames or series.
What’s even more appealing is the ability to include these functions in a method chain, making it possible to combine custom logic with other built-in functionalities.
Here’s an example that illustrates how you can use .apply()
and .transform()
in a method chain:
import pandas as pd df = pd.DataFrame({ 'ID': [801, 802, 803, 804, 805], 'Name': ['Emily', 'Chris', 'Grace', 'Ian', 'Kate'], 'Age': [32, 28, 24, 31, 22], 'Plan': ['Standard', 'Basic', 'Premium', 'Standard', 'Basic'], 'Amount': [35, 20, 55, 35, 20] }) # Define a custom function to calculate yearly amount def calculate_yearly_amount(amount): return amount * 12 # Chain queries with .apply() and .transform() for custom functions result = (df .query("Plan != 'Basic'") .assign( Yearly_Amount=lambda x: x['Amount'].apply(calculate_yearly_amount), Is_Adult=lambda x: x['Age'].transform(lambda age: 'Yes' if age >= 21 else 'No') ) ) print(result)
Output:
ID Name Age Plan Amount Yearly_Amount Is_Adult 0 801 Emily 32 Standard 35 420 Yes 2 803 Grace 24 Premium 55 660 Yes 3 804 Ian 31 Standard 35 420 Yes
In the output, the DataFrame is filtered to remove rows with the ‘Basic’ plan.
Then, the .apply()
function is used to add a Yearly_Amount
column, calculated based on the custom function calculate_yearly_amount
.
Similarly, the .transform()
function is used to create a new column Is_Adult
, which is determined by whether the age is 21 or above.
Chained with .join() and .merge()
Pandas offers robust methods like .join()
and .merge()
to join and merge DataFrames.
You can include these methods in a method chain.
Here’s an example showcasing how to use .join()
and .merge()
in a method chain:
import pandas as pd df1 = pd.DataFrame({ 'ID': [901, 902, 903, 904, 905], 'Name': ['Nathan', 'Olivia', 'Liam', 'Sophia', 'Mason'], 'Age': [34, 28, 29, 40, 25], 'Plan': ['Standard', 'Basic', 'Premium', 'Standard', 'Basic'] }) df2 = pd.DataFrame({ 'ID': [901, 902, 903, 906, 907], 'Amount': [35, 20, 50, 40, 45] }) # Chain queries with .join() and .merge() result = (df1 .query("Plan != 'Basic'") .merge(df2, on='ID', how='inner') .assign( Yearly_Amount=lambda x: x['Amount'] * 12 ) ) print(result)
Output:
ID Name Age Plan Amount Yearly_Amount 0 901 Nathan 34 Standard 35 420 1 903 Liam 29 Premium 50 600
In the output, the DataFrame df1
is first filtered to exclude rows with the ‘Basic’ plan.
Following this, it is merged with df2
based on the ‘ID’ column using an inner join.
Finally, a new column Yearly_Amount
is added to the DataFrame, which is calculated as 12 times the ‘Amount’.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.