Pandas query method on MultiIndex (Hierarchical) DataFrame
This tutorial guides you on how to use the query
method with multi-index (also known as hierarchical index) DataFrames.
By the end, you’ll understand how to perform advanced queries on a multi-index DataFrame.
Using query Method on MultiIndex DataFrame
Querying with Single Level Index
Let’s assume you have data on customer behavior metrics such as call duration and data usage.
import pandas as pd data = { "Call Duration": [300, 220, 400, 150, 200], "Data Usage": [2.5, 1.2, 3.0, 0.9, 2.0] } index = pd.MultiIndex.from_tuples( [("New York", 101), ("New York", 102), ("San Francisco", 103), ("San Francisco", 104), ("Chicago", 105)], names=["City", "CustomerID"] ) df = pd.DataFrame(data, index=index) print(df)
Output:
Call Duration Data Usage City CustomerID New York 101 300 2.5 102 220 1.2 San Francisco 103 400 3.0 104 150 0.9 Chicago 105 200 2.0
The DataFrame has multiple indices: “City” and “CustomerID.”
Now, let’s query data for customers from “New York.”
result = df.query("City == 'New York'") print(result)
Output:
Call Duration Data Usage City CustomerID New York 101 300 2.5 102 220 1.2
You directly use the index name (“City”) in the query string.
Querying with Multiple Level Indices
You can filter data based on conditions that span multiple levels of indices using query
:
result = df.query("City == 'San Francisco' and CustomerID == 103") print(result)
Output:
Call Duration Data Usage City CustomerID San Francisco 103 400 3.0
You can combine conditions for multiple index levels by using logical operators like and
or or
.
Using Regex
You can use regex in query
to perform more advanced queries involving string pattern matching.
Let’s assume you want to find customer data for cities whose names start with the letter “N.” You can accomplish this using a regular expression in your query string as follows:
result = df.query("City.str.match('N.*')") print(result)
Output:
Call Duration Data Usage City CustomerID New York 101 300 2.5 102 220 1.2
In this case, the regular expression 'N.*'
matches any city name that starts with “N,” and you get rows for customers based in “New York.”
Another example: let’s find customers who reside in cities whose names contain a vowel followed by a ‘c’ character (e.g., “Chicago”).
result = df.query("City.str.contains('[aeiou]c')") print(result)
Output:
Call Duration Data Usage City CustomerID Chicago 105 200 2.0
Using Variables in Query
You can use variables within your query to make your code more dynamic and reusable.
The query
method in Pandas allows for variable substitution.
Using Scalar Variables
Suppose you need to filter customer data for a specific city, and the name of the city is stored in a variable. You can include that variable in your query like this:
city_to_query = 'New York' result = df.query("City == @city_to_query") print(result)
Output:
Call Duration Data Usage City CustomerID New York 101 300 2.5 102 220 1.2
Here, the @
symbol before the variable name city_to_query
indicates variable substitution.
Using List Variables
If you want to query multiple values, you can use a list variable. Let’s find customer data for cities stored in a list:
cities_to_query = ['New York', 'Chicago'] result = df.query("City in @cities_to_query") print(result)
Output:
Call Duration Data Usage City CustomerID New York 101 300 2.5 102 220 1.2 Chicago 105 200 2.0
Querying Using Multiple Conditions
When working with complex data structures like multi-index DataFrames, you often need to run queries that involve multiple conditions.
Suppose you want to find customer data for those who are either from “New York” with a call duration of more than 250 seconds, or from “San Francisco” with data usage of more than 2.5GB.
Here’s how you can perform this query:
result = df.query("(City == 'New York' and `Call Duration` > 250) or (City == 'San Francisco' and `Data Usage` > 2.5)") print(result)
Output:
Call Duration Data Usage City CustomerID New York 101 300 2.5 San Francisco 103 400 3.0
Notice: we used backticks because the columns’ names have spaces.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.