Using String Methods in Pandas query(): DataFrame Filtering

The query() method in Pandas allows you to filter DataFrames, and when combined with string methods, the possibilities become extensive.

In this tutorial, you’ll learn how to use string methods within the query() function to refine and filter data.

 

 

Checking substring: str.contains()

The str.contains() method checks if a substring exists within a string and returns a Boolean series.

Let’s dive into an example:

import pandas as pd
data = {
    'Names': ['Alice Cooper', 'BOB', 'charlie', '123Dylan', 'Elle45', 'FRANK ', '  Georgia  ', 'Henry-White', 'isabelle99', 'J123ack', '  karen  ', 'LOL123'],
    'Codes': ['A1B2', 'XYZ', 'zyx998', '1234', 'W5W5', '   ', 'AB.CD', '99.99', 'HELLO', 'world', 'space  ', '123.45']
}
df = pd.DataFrame(data)
result = df.query('Names.str.contains("Dylan")')
print(result)

Output:

      Names  Codes
3  123Dylan  1234

When using the query() method combined with str.contains(), we efficiently filtered out rows containing the substring “Dylan”.

 

str.startswith() and str.endswith()

You can use the str.startswith() and str.endswith() methods to filter rows based on whether certain strings in your DataFrame start or end with a specific substring.

Using str.startswith()

You can use the str.startswith() method if you want to filter rows where the string begins with a specific substring:

# Filtering names that start with 'ch'
result_startswith = df.query('Names.str.startswith("ch")')
print(result_startswith)

Output:

     Names  Codes
2  charlie  zyx998

In this code, you’re searching for names that start with “ch”. As a result, “charlie” gets returned.

Using str.endswith()

You can use the str.endswith() method if you want to filter rows where the string ends with a particular substring:

# Filtering codes that end with '5'
result_endswith = df.query('Codes.str.endswith("5")')
print(result_endswith)

Output:

         Names   Codes
4        Elle45   W5W5
11       LOL123  123.45

Here, we get rows with codes ending with “5”.

 

Checking for uppercase: str.isupper()

The str.isupper() method allows you to check if a particular string is entirely in uppercase.

# Filtering rows where 'Names' is entirely in uppercase
result_uppercase = df.query('Names.str.isupper()')
print(result_uppercase[['Names']])

Output:

   Names
1   BOB
5  FRANK 
11  LOL123

From the given code, the goal is to filter rows in which the ‘Names’ column consists of entirely uppercase characters.

The output showcases entries like “BOB”, “FRANK “, and “LOL123” — all of which are fully uppercase.

 

Checking for lowercase: str.islower()

The str.islower()checks if a specific string is completely in lowercase:

# Filtering rows where 'Names' is entirely in lowercase
result_lowercase = df.query('Names.str.islower()')
print(result_lowercase[['Names']])

Output:

        Names
2     charlie
8  isabelle99
10     karen  

The output highlights entries such as “charlie”, “isabelle99”, and “karen” – all of which are in lowercase.

 

Checking for numeric characters

The str.isnumeric() and str.isdigit().methods allow you to check if a string contains purely numeric characters.

Both methods serve to identify strings composed solely of digits, but there are nuances in their behavior, especially when considering Unicode characters.

Using str.isnumeric()

This method checks if all characters in the string are numeric characters and the string is not empty:

import pandas as pd
df = pd.DataFrame({
    'data': ['12345', '½', 'abcd', '123a']
})

# Query with str.isnumeric()
print(df.query("`data`.str.isnumeric()"))

Output:

    data
0  12345
1      ½

Using str.isdigit()

The str.isnumeric(), method checks if all characters in the string are digits and the string is not empty:

# Query with str.isdigit()
print(df.query("`data`.str.isdigit()"))

Output:

    data
0  12345

As we can see, the str.isnumeric() recognized ‘12345’ and ‘½’ as containing numeric characters, while the str.isdigit() only recognized ‘12345’ as containing digit characters.

Note the use of backticks (“) around the column name, which is needed when the column name collides with a function name or when it contains spaces or special characters.

 

Checking for alphanumerics: str.isalnum()

The str.isalnum() method in Pandas allows you to check if a string is composed of alphanumeric characters (both letters and numbers), without any spaces or special characters.

# Filtering rows where 'Names' contains only alphanumeric characters
result_alnum = df.query('Names.str.isalnum()')
print(result_alnum[['Names']])

Output:

         Names
1          BOB
2      charlie
3     123Dylan
4       Elle45
8   isabelle99
9      J123ack
11      LOL123

From the executed code, you can observe that we’re attempting to filter rows wherein the ‘Names’ column consists solely of alphanumeric characters.

 

Checking for decimals: str.isdecimal()

The str.isdecimal() checks if the string contains characters from the Unicode decimal category, which essentially includes integers but not floats or any other numeric representations.

# Filtering rows where 'Names' is purely decimal
result_decimal = df.query('Codes.str.isdecimal()')
print(result_decimal[['Codes']])

Output:

  Codes
3  1234

From the code above, we filter rows where the ‘Codes’ column consists purely of decimal characters. In our sample data, the only output is “1234”.

 

Checking for whitespace: str.isspace()

Identifying strings that are composed entirely of whitespace can be vital for data cleaning and validation processes.

Such strings can often be the result of data entry errors, misaligned imports, or placeholder values. Using the str.isspace() method in Pandas, you can effortlessly detect these whitespace-only strings.

Let’s illustrate its usage with our sample DataFrame:

# Filtering rows where 'Names' is entirely whitespace
result_whitespace = df.query('Codes.str.isspace()')
print(result_whitespace[['Names']])

Output:

    Names
5  FRANK  

From the above code, we filter rows in which the ‘Codes’ column consists only of whitespace.

Then we print the name of that code which is ‘FRANK’.

 

Checking for titlecase: str.istitle()

The title case, where the first letter of each word is capitalized, is a common format for names, titles, or headers.

You can use str.istitle() method to check for strings that follow this pattern:

# Filtering rows where 'Names' is in titlecase
result_titlecase = df.query('Names.str.istitle()')
print(result_titlecase[['Names']])

Output:

          Names
0  Alice Cooper
3      123Dylan
4        Elle45
6     Georgia  
7   Henry-White

From the executed code, we’re filtering rows where the ‘Names’ column adheres to the title case format.

The output highlights “Alice Cooper” and “Henry-White”. Even though “Henry-White” contains a hyphen, each segmented word starts with a capitalized letter, which aligns with the titlecase structure.

Leave a Reply

Your email address will not be published. Required fields are marked *