Filter DataFrames Using Pandas query method chaining

Method chaining allows you to perform multiple operations in a single expression.

This tutorial focuses on integrating the Pandas query method into your method chaining workflow to filter DataFrames.

 

 

Using query Method with pipe()

The pipe method in Pandas allows you to chain together multiple operations.

The advantage of using pipe is its ability to accept user-defined functions.

You can integrate the query method within a pipe to achieve complex filtering and transformation steps in one go.

Here’s an example:

import pandas as pd
df = pd.DataFrame({
    'ID': [101, 102, 103, 104, 105],
    'Name': ['John', 'Sara', 'Mike', 'Emily', 'Tom'],
    'Age': [34, 45, 22, 29, 31],
    'Plan': ['Basic', 'Premium', 'Basic', 'Standard', 'Premium'],
    'Amount': [20, 50, 20, 30, 50]
})

# Define a function for age filter
def filter_by_age(df, min_age, max_age):
    return df.query(f"{min_age} <= Age <= {max_age}")

# Chain multiple operations using .pipe()
result = (df
          .pipe(filter_by_age, min_age=30, max_age=40)
          .query("Plan == 'Premium'")
          .sort_values('Amount')
         )
print(result)

Output:

    ID  Name  Age     Plan  Amount
4  105   Tom   31  Premium      50

In the output, you’ll notice that the DataFrame has been filtered to include only those records where the age is between 30 and 40, and the plan is ‘Premium’.

Additionally, the records are sorted by the ‘Amount’ column.

 

Method Chaining with .loc[] and .iloc[]

While .loc[] and .iloc[] are primarily used for label-based and integer-based indexing, you can combine them with the query method for even more powerful filtering.

Let’s see an example to understand this better:

import pandas as pd
df = pd.DataFrame({
    'ID': [201, 202, 203, 204, 205],
    'Name': ['Anna', 'Sophie', 'Sam', 'Nina', 'Robert'],
    'Age': [25, 33, 41, 52, 29],
    'Plan': ['Basic', 'Standard', 'Premium', 'Basic', 'Standard'],
    'Amount': [10, 30, 45, 15, 30]
})

# Chain multiple operations using .query(), .loc[], and .iloc[]
result = (df
          .query("Plan == 'Standard' or Plan == 'Basic'")
          .loc[:, ['ID', 'Name', 'Plan']]
          .iloc[1:4]
         )
print(result)

Output:

    ID    Name      Plan
1  202  Sophie  Standard
3  204    Nina     Basic

In the output, the DataFrame is first filtered to include only the rows with plans ‘Standard’ or ‘Basic’.

Then, the .loc[] method selects only the ‘ID’, ‘Name’, and ‘Plan’ columns. Finally, .iloc[1:4] slices the rows from index 1 to 3.

 

Using .assign() and Chaining to Add New Columns

The assign method in Pandas allows you to add one or multiple new columns to a DataFrame.

What makes assign even more powerful is that it can be integrated into a method chaining operation.

Let’s go through an example to illustrate this:

import pandas as pd
df = pd.DataFrame({
    'ID': [401, 402, 403, 404, 405],
    'Name': ['George', 'Lucas', 'Grace', 'Sophia', 'Jack'],
    'Age': [25, 27, 31, 28, 24],
    'Plan': ['Premium', 'Basic', 'Standard', 'Basic', 'Premium'],
    'Amount': [45, 20, 30, 20, 45]
})

# Add new columns using .assign() in a method chain
result = (df
          .query("Age > 25")
          .assign(
              Total_Amount=lambda x: x['Amount'] * 12,
              Is_Premium=lambda x: x['Plan'] == 'Premium'
          )
         )
print(result)

Output:

    ID    Name  Age      Plan  Amount  Total_Amount  Is_Premium
1  402   Lucas   27     Basic      20           240       False
2  403   Grace   31  Standard      30           360       False
3  404  Sophia   28     Basic      20           240       False

The output shows that the DataFrame has been filtered to include only the rows where the age is greater than 25.

Two new columns have been added: Total_Amount and Is_Premium. The Total_Amount column is calculated by multiplying the Amount by 12, and Is_Premium indicates whether the plan is ‘Premium’.

 

Chaining Query Filters with groupby() and Aggregation

You can use .groupby() in the chain to aggregate your data based on specific conditions.

import pandas as pd
df = pd.DataFrame({
    'ID': [501, 502, 503, 504, 505],
    'Name': ['Catherine', 'James', 'Harry', 'Irene', 'Oliver'],
    'Age': [36, 28, 33, 41, 38],
    'Plan': ['Standard', 'Basic', 'Standard', 'Premium', 'Standard'],
    'Amount': [30, 20, 30, 45, 30]
})

# Chain query filters with .groupby() to aggregate data
result = (df
          .query("Plan != 'Basic'")
          .groupby('Plan')
          .agg(
              Avg_Age=('Age', 'mean'),
              Total_Amount=('Amount', 'sum')
          )
          .reset_index()
         )
print(result)

Output:

       Plan    Avg_Age  Total_Amount
0   Premium  41.000000            45
1  Standard  35.666667            90

In the output, the DataFrame is first filtered to exclude rows with the ‘Basic’ plan. Following this, the .groupby() method aggregates the data by the ‘Plan’ column.

The aggregation calculates the average age and total amount for each plan.

Lastly, .reset_index() ensures that the output DataFrame has a clean index.

 

Chain query with xs() for MultiIndexes

The .xs() method in Pandas allows you to select data across multi-level indexes.

The best part? You can integrate .xs() in a method chain along with the query method to perform in-depth, layered filtering.

import pandas as pd
arrays = [
    ['A', 'A', 'B', 'B', 'C', 'C'],
    [1, 2, 1, 2, 1, 2]
]
index = pd.MultiIndex.from_arrays(arrays, names=('Letters', 'Numbers'))
df = pd.DataFrame({
    'ID': [601, 602, 603, 604, 605, 606],
    'Name': ['Victor', 'Jenny', 'Lily', 'Ben', 'Claire', 'Dan'],
    'Age': [29, 43, 35, 29, 38, 27],
    'Amount': [30, 45, 30, 20, 35, 25]
}, index=index)

# Chain query filters with .xs() for MultiIndexes
result = (df
          .query("Age >= 30")
          .xs(key=1, level='Numbers', axis=0)
          .sort_values('Age')
         )
print(result)

Output:

          ID    Name  Age  Amount
Letters                          
B        603    Lily   35      30
C        605  Claire   38      35

In this output, the DataFrame is filtered to include only the rows where the age is 30 or above.

Then, the .xs() method selects the rows that correspond to the first level (‘Numbers’) being 1.

Lastly, the sort_values('Age') sorts these filtered rows based on age.

 

Chain Custom Functions using apply() and transform()

The .apply() and .transform() methods are used to apply functions across DataFrames or series.

What’s even more appealing is the ability to include these functions in a method chain, making it possible to combine custom logic with other built-in functionalities.

Here’s an example that illustrates how you can use .apply() and .transform() in a method chain:

import pandas as pd
df = pd.DataFrame({
    'ID': [801, 802, 803, 804, 805],
    'Name': ['Emily', 'Chris', 'Grace', 'Ian', 'Kate'],
    'Age': [32, 28, 24, 31, 22],
    'Plan': ['Standard', 'Basic', 'Premium', 'Standard', 'Basic'],
    'Amount': [35, 20, 55, 35, 20]
})

# Define a custom function to calculate yearly amount
def calculate_yearly_amount(amount):
    return amount * 12

# Chain queries with .apply() and .transform() for custom functions
result = (df
          .query("Plan != 'Basic'")
          .assign(
              Yearly_Amount=lambda x: x['Amount'].apply(calculate_yearly_amount),
              Is_Adult=lambda x: x['Age'].transform(lambda age: 'Yes' if age >= 21 else 'No')
          )
         )
print(result)

Output:

    ID   Name  Age      Plan  Amount  Yearly_Amount Is_Adult
0  801  Emily   32  Standard      35            420      Yes
2  803  Grace   24   Premium      55            660      Yes
3  804    Ian   31  Standard      35            420      Yes

In the output, the DataFrame is filtered to remove rows with the ‘Basic’ plan.

Then, the .apply() function is used to add a Yearly_Amount column, calculated based on the custom function calculate_yearly_amount.

Similarly, the .transform() function is used to create a new column Is_Adult, which is determined by whether the age is 21 or above.

 

Chained with .join() and .merge()

Pandas offers robust methods like .join() and .merge() to join and merge DataFrames.

You can include these methods in a method chain.

Here’s an example showcasing how to use .join() and .merge() in a method chain:

import pandas as pd
df1 = pd.DataFrame({
    'ID': [901, 902, 903, 904, 905],
    'Name': ['Nathan', 'Olivia', 'Liam', 'Sophia', 'Mason'],
    'Age': [34, 28, 29, 40, 25],
    'Plan': ['Standard', 'Basic', 'Premium', 'Standard', 'Basic']
})
df2 = pd.DataFrame({
    'ID': [901, 902, 903, 906, 907],
    'Amount': [35, 20, 50, 40, 45]
})

# Chain queries with .join() and .merge()
result = (df1
          .query("Plan != 'Basic'")
          .merge(df2, on='ID', how='inner')
          .assign(
              Yearly_Amount=lambda x: x['Amount'] * 12
          )
         )
print(result)

Output:

    ID    Name  Age      Plan  Amount  Yearly_Amount
0  901  Nathan   34  Standard      35            420
1  903    Liam   29   Premium      50            600

In the output, the DataFrame df1 is first filtered to exclude rows with the ‘Basic’ plan.

Following this, it is merged with df2 based on the ‘ID’ column using an inner join.

Finally, a new column Yearly_Amount is added to the DataFrame, which is calculated as 12 times the ‘Amount’.

Leave a Reply

Your email address will not be published. Required fields are marked *