Using String Methods in Pandas query(): DataFrame Filtering
The query()
method in Pandas allows you to filter DataFrames, and when combined with string methods, the possibilities become extensive.
In this tutorial, you’ll learn how to use string methods within the query()
function to refine and filter data.
- 1 Checking substring: str.contains()
- 2 str.startswith() and str.endswith()
- 3 Checking for uppercase: str.isupper()
- 4 Checking for lowercase: str.islower()
- 5 Checking for numeric characters
- 6 Checking for alphanumerics: str.isalnum()
- 7 Checking for decimals: str.isdecimal()
- 8 Checking for whitespace: str.isspace()
- 9 Checking for titlecase: str.istitle()
Checking substring: str.contains()
The str.contains()
method checks if a substring exists within a string and returns a Boolean series.
Let’s dive into an example:
import pandas as pd data = { 'Names': ['Alice Cooper', 'BOB', 'charlie', '123Dylan', 'Elle45', 'FRANK ', ' Georgia ', 'Henry-White', 'isabelle99', 'J123ack', ' karen ', 'LOL123'], 'Codes': ['A1B2', 'XYZ', 'zyx998', '1234', 'W5W5', ' ', 'AB.CD', '99.99', 'HELLO', 'world', 'space ', '123.45'] } df = pd.DataFrame(data) result = df.query('Names.str.contains("Dylan")') print(result)
Output:
Names Codes 3 123Dylan 1234
When using the query()
method combined with str.contains()
, we efficiently filtered out rows containing the substring “Dylan”.
str.startswith() and str.endswith()
You can use the str.startswith()
and str.endswith()
methods to filter rows based on whether certain strings in your DataFrame start or end with a specific substring.
Using str.startswith()
You can use the str.startswith()
method if you want to filter rows where the string begins with a specific substring:
# Filtering names that start with 'ch' result_startswith = df.query('Names.str.startswith("ch")') print(result_startswith)
Output:
Names Codes 2 charlie zyx998
In this code, you’re searching for names that start with “ch”. As a result, “charlie” gets returned.
Using str.endswith()
You can use the str.endswith()
method if you want to filter rows where the string ends with a particular substring:
# Filtering codes that end with '5' result_endswith = df.query('Codes.str.endswith("5")') print(result_endswith)
Output:
Names Codes 4 Elle45 W5W5 11 LOL123 123.45
Here, we get rows with codes ending with “5”.
Checking for uppercase: str.isupper()
The str.isupper()
method allows you to check if a particular string is entirely in uppercase.
# Filtering rows where 'Names' is entirely in uppercase result_uppercase = df.query('Names.str.isupper()') print(result_uppercase[['Names']])
Output:
Names 1 BOB 5 FRANK 11 LOL123
From the given code, the goal is to filter rows in which the ‘Names’ column consists of entirely uppercase characters.
The output showcases entries like “BOB”, “FRANK “, and “LOL123” — all of which are fully uppercase.
Checking for lowercase: str.islower()
The str.islower()
checks if a specific string is completely in lowercase:
# Filtering rows where 'Names' is entirely in lowercase result_lowercase = df.query('Names.str.islower()') print(result_lowercase[['Names']])
Output:
Names 2 charlie 8 isabelle99 10 karen
The output highlights entries such as “charlie”, “isabelle99”, and “karen” – all of which are in lowercase.
Checking for numeric characters
The str.isnumeric()
and str.isdigit()
.methods allow you to check if a string contains purely numeric characters.
Both methods serve to identify strings composed solely of digits, but there are nuances in their behavior, especially when considering Unicode characters.
Using str.isnumeric()
This method checks if all characters in the string are numeric characters and the string is not empty:
import pandas as pd df = pd.DataFrame({ 'data': ['12345', '½', 'abcd', '123a'] }) # Query with str.isnumeric() print(df.query("`data`.str.isnumeric()"))
Output:
data 0 12345 1 ½
Using str.isdigit()
The str.isnumeric()
, method checks if all characters in the string are digits and the string is not empty:
# Query with str.isdigit() print(df.query("`data`.str.isdigit()"))
Output:
data 0 12345
As we can see, the str.isnumeric()
recognized ‘12345’ and ‘½’ as containing numeric characters, while the str.isdigit()
only recognized ‘12345’ as containing digit characters.
Note the use of backticks (“) around the column name, which is needed when the column name collides with a function name or when it contains spaces or special characters.
Checking for alphanumerics: str.isalnum()
The str.isalnum()
method in Pandas allows you to check if a string is composed of alphanumeric characters (both letters and numbers), without any spaces or special characters.
# Filtering rows where 'Names' contains only alphanumeric characters result_alnum = df.query('Names.str.isalnum()') print(result_alnum[['Names']])
Output:
Names 1 BOB 2 charlie 3 123Dylan 4 Elle45 8 isabelle99 9 J123ack 11 LOL123
From the executed code, you can observe that we’re attempting to filter rows wherein the ‘Names’ column consists solely of alphanumeric characters.
Checking for decimals: str.isdecimal()
The str.isdecimal()
checks if the string contains characters from the Unicode decimal category, which essentially includes integers but not floats or any other numeric representations.
# Filtering rows where 'Names' is purely decimal result_decimal = df.query('Codes.str.isdecimal()') print(result_decimal[['Codes']])
Output:
Codes 3 1234
From the code above, we filter rows where the ‘Codes’ column consists purely of decimal characters. In our sample data, the only output is “1234”.
Checking for whitespace: str.isspace()
Identifying strings that are composed entirely of whitespace can be vital for data cleaning and validation processes.
Such strings can often be the result of data entry errors, misaligned imports, or placeholder values. Using the str.isspace()
method in Pandas, you can effortlessly detect these whitespace-only strings.
Let’s illustrate its usage with our sample DataFrame:
# Filtering rows where 'Names' is entirely whitespace result_whitespace = df.query('Codes.str.isspace()') print(result_whitespace[['Names']])
Output:
Names 5 FRANK
From the above code, we filter rows in which the ‘Codes’ column consists only of whitespace.
Then we print the name of that code which is ‘FRANK’.
Checking for titlecase: str.istitle()
The title case, where the first letter of each word is capitalized, is a common format for names, titles, or headers.
You can use str.istitle()
method to check for strings that follow this pattern:
# Filtering rows where 'Names' is in titlecase result_titlecase = df.query('Names.str.istitle()') print(result_titlecase[['Names']])
Output:
Names 0 Alice Cooper 3 123Dylan 4 Elle45 6 Georgia 7 Henry-White
From the executed code, we’re filtering rows where the ‘Names’ column adheres to the title case format.
The output highlights “Alice Cooper” and “Henry-White”. Even though “Henry-White” contains a hyphen, each segmented word starts with a capitalized letter, which aligns with the titlecase structure.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.