DataFrame Filtering: Using Regex in Pandas Query
The query()
method in Pandas allows you to filter DataFrame rows based on a query expression.
Combining the power of regular expressions (regex) with the query()
method allows for more advanced and flexible querying.
This tutorial will guide you through the process of using regex with the query()
method in Pandas.
- 1 Match the Start or End of Strings
- 2 Basic regex patterns
- 3 Character classes and Ranges
- 4 Excluding rows based on certain patterns
- 5 Ignore Case Sensitivity in Regex Queries
- 6 Positive and Negative Lookaheads/Lookbehinds
- 7 Filtering based on patterns with special characters
- 8 Grouping and Capturing
- 9 Backreferences
Match the Start or End of Strings
You can use the following characters to match the start or end of a string:
^
: Represents the start of a string.$
: Represents the end of a string.
Matching the Beginning
First, let’s create a sample DataFrame:
import pandas as pd data = { 'item_name': ['apple', 'banana', 'cherry', 'pineapple', 'apple pie', 'banana split'] } df = pd.DataFrame(data)
Let’s discover items that initiate with the substring ‘apple’.
start_with_apple = df.query("item_name.str.contains('^apple')") print(start_with_apple)
Output:
item_name 0 apple 4 apple pie
Here, the pattern ^apple
is used, ensuring the selection of only those items that commence with ‘apple’.
Matching the End
Now, to identify items ends with ‘apple’.
end_with_apple = df.query("item_name.str.contains('apple$')") print(end_with_apple)
Output:
item_name 0 apple 3 pineapple
This ensures the extraction of items that conclude with ‘apple’.
Basic regex patterns
Regular expressions come with foundational patterns and metacharacters to empower your text matching tasks:
.
(Dot): Matches any single character except a new line.*
(Asterisk): Matches zero or more repetitions of the preceding character or group.
Let’s illustrate these patterns.
# Query to find item names that contain exactly five characters five_chars = df.query("item_name.str.contains('^.....$')") print(five_chars)
Output:
item_name 0 apple
In this query, .....
matches any item name with five characters anywhere within the string.
For our next example, let’s identify items containing any set of characters followed by ‘apple’.
pattern_start = df.query("item_name.str.contains('.*apple')") print(pattern_start)
Output:
item_name 0 apple 3 pineapple 4 apple pie
The .*
translates to “zero or more of any character”.
Character classes and Ranges
Character classes are depicted using square brackets []
. For instance, [abc]
would match any single character that is either ‘a’, ‘b’, or ‘c’.
Let’s identify items that contain either the letter ‘a’ or ‘p’ followed immediately by the letter ‘i’.
char_class_query = df.query("item_name.str.contains('[ap]i')") print(char_class_query)
Output:
item_name 3 pineapple 4 apple pie
In this query, the pattern [ap]i
captures items that have either ‘ai’ or ‘pi’ in their names.
Ranges
Ranges are a subset of character classes. Instead of listing individual characters, you can specify a range using a hyphen. For instance, [a-c]
would match any single character that is ‘a’, ‘b’, or ‘c’.
To illustrate, let’s extract items with names that contain any lowercase letter between ‘a’ and ‘c’ followed by the letter ‘e’.
range_query = df.query("item_name.str.contains('[a-h]e')") print(range_query)
Output:
item_name 2 cherry
The pattern [a-h]e
matches items that contain substrings like ‘ae’, ‘be’, ‘ce’, ‘de’, ‘ee’, ‘fe’, ‘ge, and ‘he’.
Excluding rows based on certain patterns
Using character classes in conjunction with the ^
character (inside square brackets) enables you to negate specific characters.
For instance, if you wish to exclude items containing the letters ‘a’ or ‘p’:
exclude_chars = df.query("not item_name.str.contains('[ap]')") print(exclude_chars)
Output:
item_name 2 cherry
The pattern not item_name.str.contains('[ap]')
ensures rows with the letters ‘a’ or ‘p’ are excluded from the result.
Excluding Specific Patterns
To exclude rows based on particular patterns, use the not
keyword alongside your regex pattern within the query()
method.
Say, you want to exclude items that end with ‘apple’:
exclude_pattern = df.query("not item_name.str.contains('apple$')") print(exclude_pattern)
Output:
item_name 1 banana 2 cherry 4 apple pie 5 banana split
Ignore Case Sensitivity in Regex Queries
By default, the search will differentiate between lowercase and uppercase characters.
To perform a case-insensitive search, you can use the case parameter.
Here’s how to search for ‘apple’ in a case-insensitive way:
import re case_insensitive_search = df.query('item_name.str.contains("APPLE", case=False)') print(case_insensitive_search)
Output:
item_name 0 apple 3 pineapple 4 apple pie
Using the above code, the search will match both ‘apple’ and ‘pineapple’ (and any other string containing ‘apple’ regardless of case).
Positive and Negative Lookaheads/Lookbehinds
A positive lookahead asserts what is directly following the current position in the string but doesn’t consume any characters. The syntax is (?=...)
.
Find Text followed by a Space
For example, to find items that have ‘apple’ followed by a space:
positive_lookahead_query = df.query("item_name.str.contains('apple(?= )')") print(positive_lookahead_query)
Output:
item_name 4 apple pie
The pattern ‘apple(?= )’ matches the term ‘apple’ only when it’s directly followed by a space.
Find Text not followed by a space
The syntax is (?!...)
.
To find items with ‘apple’ not followed by a space:
negative_lookahead_query = df.query("item_name.str.contains('apple(?! )')") print(negative_lookahead_query)
Output:
item_name 0 apple 3 pineapple
Find Text with space before it
A positive lookbehind asserts what is directly preceding the current position in the string but, again, doesn’t consume characters. The syntax is (?<=...)
.
To find items that have a space preceding the word ‘pie’:
positive_lookbehind_query = df.query("item_name.str.contains('(?<= )pie')") print(positive_lookbehind_query)
Output:
item_name 4 apple pie
The pattern ‘(?<= )pie’ matches the term ‘pie’ only when it’s directly preceded by a space.
Find Text without space before it
A negative lookbehind asserts that a certain string is not directly preceding the current position. The syntax is (?<!...)
.
To find items with ‘pie’ not preceded by a space:
negative_lookbehind_query = df.query("item_name.str.contains('(?<! )pi')") print(negative_lookbehind_query)
Output:
item_name 3 pineapple
In this case, only ‘pineapple’ contains the term ‘pi’ not preceded by a space.
Filtering based on patterns with special characters
To match special characters such as $.^*+?{}[]\()|
, you must “escape” them using a backslash \
.
For instance, if you wish to find items with the dollar sign $
:
import pandas as pd data = { 'item_name': ['$apple', 'banana', 'cherry', 'pineapple', 'apple pie', 'banana split'] } df = pd.DataFrame(data) special_char_query = df.query("item_name.str.contains('\\$')") print(special_char_query)
Output:
item_name 0 $apple
Here, the pattern \\$
ensures that the dollar sign is treated as a literal character and not as a regex metacharacter representing the end of a string.
Grouping and Capturing
Groups in regex are created using parentheses (...)
. They not only group elements of your pattern together but also capture the content matched by the group.
For example, to identify items that start with ‘apple’ followed by any sequence of characters:
grouping_query = df.query("item_name.str.contains('^(apple)(.+)$')") print(grouping_query)
Output:
item_name 4 apple pie
Here, ^(apple)(.+)$
has two groups: (apple)
and (.+)
. The first group captures the word ‘apple’, and the second captures any one or more characters.
Capturing Content from Groups
To extract specific captured content, you’d typically use the str.extract()
method:
captured_content = df['item_name'].str.extract('^(apple)(.*)$') print(captured_content)
Output:
0 1 0 apple 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 apple pie 5 NaN NaN
The DataFrame captured_content
has two columns, each representing a capturing group from the regex pattern.
Rows with matches display the captured content, while others show NaN.
Non-Capturing Groups
There are situations where you want to group elements without capturing the matched content.
This can be done using the syntax (?:...)
.
For example, to group ‘apple’ or ‘banana’ but without capturing:
non_capturing_group_query = df.query("item_name.str.contains('^(?:apple|banana)')") print(non_capturing_group_query)
Output:
item_name 0 apple 1 banana 4 apple pie 5 banana split
The pattern ^(?:apple|banana)
groups ‘apple’ and ‘banana’ for the alternation but doesn’t capture their matches.
Backreferences
A backreference is represented by \n
where n
is the number corresponding to the captured group you’re referring to. For instance, \1
refers to the first capturing group.
To demonstrate, let’s find items with repeated words, like “apple apple”:
import pandas as pd data = { 'item_name': ['apple apple', 'banana', 'cherry', 'pineapple', 'apple pie', 'banana split'] } df = pd.DataFrame(data) backreference_query = df.query(r"item_name.str.contains('^(\\w+) \\1$')") print(backreference_query)
Output:
item_name 0 apple apple
In the pattern ^(\\w+) \\1$
, the first group (\\w+)
captures a sequence of word characters.
The backreference \\1
then refers to the content matched by this group, ensuring the same sequence appears again.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.