Filter Pandas DataFrames Based on String Length in query()

In this tutorial, you’ll learn how to use the query method in Pandas to filter DataFrames based on the exact string length, how to set conditions for minimum, maximum, and a specific range of string lengths.

You’ll learn how to use string length as a criterion to filter your DataFrame.

 

 

Using Exact String Length

Let’s say you want to find all records where the identifier has an exact length of 10 characters.

You can use the .str.len() method within the query statement to specify the string length:

import pandas as pd
data = {
    'ID': ['1234567890', '987654321', '12345', 'ABCDE12345', '1234ABCD5678'],
    'Plan_Type': ['Basic', 'Premium', 'Basic', 'Premium', 'Basic'],
    'Monthly_Charge': [30, 50, 20, 45, 35]
}
df = pd.DataFrame(data)
filtered_df = df.query("ID.str.len() == 10")
print(filtered_df)

Output:

           ID Plan_Type  Monthly_Charge
0  1234567890     Basic              30
3  ABCDE12345   Premium              45

Here, we filter the DataFrame to include only those rows where the ‘ID’ field is exactly 10 characters long.

 

Using Minimum String Length

Suppose you’re looking to filter out identifiers that are at least 8 characters long:

# Filter DataFrame for rows where 'ID' has 8 or more characters
filtered_df_min_length = df.query("ID.str.len() >= 8")
print(filtered_df_min_length)

Output:

             ID Plan_Type  Monthly_Charge
0    1234567890     Basic              30
1     987654321   Premium              50
3    ABCDE12345   Premium              45
4  1234ABCD5678     Basic              35

With ID.str.len() >= 8 inside the query method, the DataFrame is filtered to display only the rows where the ‘ID’ column’s strings are at least 8 characters in length.

 

Using Maximum String Length

In other cases, you might want to focus on strings that do not exceed a certain length.

For example, if you are interested in identifying all user IDs that are shorter than 10 characters:

# Filter DataFrame for rows where 'ID' is less than 10 characters
filtered_df_max_length = df.query("ID.str.len() < 10")
print(filtered_df_max_length)

Output:

         ID Plan_Type  Monthly_Charge
1  987654321   Premium              50
2      12345     Basic              20

In this code snippet, ID.str.len() < 10 is the condition used within the query method to filter the DataFrame.

The resulting output includes only those records with ‘ID’ strings shorter than 10 characters.

 

Using String Length Range

There may be instances where you are interested in a range of string lengths.

Suppose you want to identify records where the string length of identifiers is between 8 and 10 characters.

# Filter DataFrame for rows where 'ID' length is between 8 and 10 characters
filtered_df_range_length = df.query("ID.str.len() >= 8 and ID.str.len() <= 10")
print(filtered_df_range_length)

Output:

           ID Plan_Type  Monthly_Charge
0  1234567890     Basic              30
1   987654321   Premium              50
3  ABCDE12345   Premium              45

By combining two conditions with the logical and operator within the query statement, you can easily filter the DataFrame for rows where the ‘ID’ column has string lengths within the specified range, in this case, 8 to 10 characters inclusive.

 

Using String Length using a variable

There might be times when you’d prefer not to hard-code the string length values directly into the query expression. Instead, you can use a variable within your query.

This is especially beneficial when you have dynamic criteria that may change over time or when you are working with user input.

Let’s take a look at how to implement this approach:

desired_length = 10

# Filter DataFrame using a variable in the query method
filtered_df_variable_length = df.query("ID.str.len() == @desired_length")
print(filtered_df_variable_length)

Output:

           ID Plan_Type  Monthly_Charge
0  1234567890     Basic              30
3  ABCDE12345   Premium              45

In the code above, @desired_length is used within the query method to reference the desired_length variable defined outside the query.

The DataFrame is filtered to include only those records where the ‘ID’ field matches the string length specified by desired_length.

Leave a Reply

Your email address will not be published. Required fields are marked *