Pandas query method on MultiIndex (Hierarchical) DataFrame

This tutorial guides you on how to use the query method with multi-index (also known as hierarchical index) DataFrames.

By the end, you’ll understand how to perform advanced queries on a multi-index DataFrame.

 

 

Using query Method on MultiIndex DataFrame

Querying with Single Level Index

Let’s assume you have data on customer behavior metrics such as call duration and data usage.

import pandas as pd
data = {
    "Call Duration": [300, 220, 400, 150, 200],
    "Data Usage": [2.5, 1.2, 3.0, 0.9, 2.0]
}
index = pd.MultiIndex.from_tuples(
    [("New York", 101), ("New York", 102), ("San Francisco", 103), ("San Francisco", 104), ("Chicago", 105)],
    names=["City", "CustomerID"]
)
df = pd.DataFrame(data, index=index)
print(df)

Output:

                          Call Duration  Data Usage
City          CustomerID                           
New York      101                   300         2.5
              102                   220         1.2
San Francisco 103                   400         3.0
              104                   150         0.9
Chicago       105                   200         2.0

The DataFrame has multiple indices: “City” and “CustomerID.”

Now, let’s query data for customers from “New York.”

result = df.query("City == 'New York'")
print(result)

Output:

                     Call Duration  Data Usage
City     CustomerID                          
New York 101                   300         2.5
         102                   220         1.2

You directly use the index name (“City”) in the query string.

Querying with Multiple Level Indices

You can filter data based on conditions that span multiple levels of indices using query:

result = df.query("City == 'San Francisco' and CustomerID == 103")
print(result)

Output:

                          Call Duration  Data Usage
City          CustomerID                           
San Francisco 103                   400         3.0

You can combine conditions for multiple index levels by using logical operators like and or or.

 

Using Regex

You can use regex in query to perform more advanced queries involving string pattern matching.

Let’s assume you want to find customer data for cities whose names start with the letter “N.” You can accomplish this using a regular expression in your query string as follows:

result = df.query("City.str.match('N.*')")
print(result)

Output:

                     Call Duration  Data Usage
City     CustomerID                           
New York 101                   300         2.5
         102                   220         1.2

In this case, the regular expression 'N.*' matches any city name that starts with “N,” and you get rows for customers based in “New York.”

Another example: let’s find customers who reside in cities whose names contain a vowel followed by a ‘c’ character (e.g., “Chicago”).

result = df.query("City.str.contains('[aeiou]c')")
print(result)

Output:

                    Call Duration  Data Usage
City    CustomerID                           
Chicago 105                   200         2.0

 

Using Variables in Query

You can use variables within your query to make your code more dynamic and reusable.

The query method in Pandas allows for variable substitution.

Using Scalar Variables

Suppose you need to filter customer data for a specific city, and the name of the city is stored in a variable. You can include that variable in your query like this:

city_to_query = 'New York'
result = df.query("City == @city_to_query")
print(result)

Output:

                  Call Duration  Data Usage
City     CustomerID                        
New York 101                300         2.5
         102                220         1.2

Here, the @ symbol before the variable name city_to_query indicates variable substitution.

Using List Variables

If you want to query multiple values, you can use a list variable. Let’s find customer data for cities stored in a list:

cities_to_query = ['New York', 'Chicago']
result = df.query("City in @cities_to_query")
print(result)

Output:

                  Call Duration  Data Usage
City     CustomerID                        
New York 101                300         2.5
         102                220         1.2
Chicago  105                200         2.0

 

Querying Using Multiple Conditions

When working with complex data structures like multi-index DataFrames, you often need to run queries that involve multiple conditions.

Suppose you want to find customer data for those who are either from “New York” with a call duration of more than 250 seconds, or from “San Francisco” with data usage of more than 2.5GB.

Here’s how you can perform this query:

result = df.query("(City == 'New York' and `Call Duration` > 250) or (City == 'San Francisco' and `Data Usage` > 2.5)")
print(result)

Output:

                          Call Duration  Data Usage
City          CustomerID                           
New York      101                   300         2.5
San Francisco 103                   400         3.0

Notice: we used backticks because the columns’ names have spaces.

Leave a Reply

Your email address will not be published. Required fields are marked *