DataFrame Filtering: Using Regex in Pandas Query

The query() method in Pandas allows you to filter DataFrame rows based on a query expression.

Combining the power of regular expressions (regex) with the query() method allows for more advanced and flexible querying.

This tutorial will guide you through the process of using regex with the query() method in Pandas.

 

 

Match the Start or End of Strings

You can use the following characters to match the start or end of a string:

  1. ^: Represents the start of a string.
  2. $: Represents the end of a string.

Matching the Beginning

First, let’s create a sample DataFrame:

import pandas as pd
data = {
    'item_name': ['apple', 'banana', 'cherry', 'pineapple', 'apple pie', 'banana split']
}
df = pd.DataFrame(data)

Let’s discover items that initiate with the substring ‘apple’.

start_with_apple = df.query("item_name.str.contains('^apple')")
print(start_with_apple)

Output:

   item_name
0      apple
4  apple pie

Here, the pattern ^apple is used, ensuring the selection of only those items that commence with ‘apple’.

Matching the End

Now, to identify items ends with ‘apple’.

end_with_apple = df.query("item_name.str.contains('apple$')")
print(end_with_apple)

Output:

   item_name
0      apple
3  pineapple

This ensures the extraction of items that conclude with ‘apple’.

 

Basic regex patterns

Regular expressions come with foundational patterns and metacharacters to empower your text matching tasks:

  1. . (Dot): Matches any single character except a new line.
  2. * (Asterisk): Matches zero or more repetitions of the preceding character or group.

Let’s illustrate these patterns.

# Query to find item names that contain exactly five characters
five_chars = df.query("item_name.str.contains('^.....$')")
print(five_chars)

Output:

  item_name
0     apple

In this query, ..... matches any item name with five characters anywhere within the string.

For our next example, let’s identify items containing any set of characters followed by ‘apple’.

pattern_start = df.query("item_name.str.contains('.*apple')")
print(pattern_start)

Output:

   item_name
0      apple
3  pineapple
4  apple pie

The .* translates to “zero or more of any character”.

 

Character classes and Ranges

Character classes are depicted using square brackets []. For instance, [abc] would match any single character that is either ‘a’, ‘b’, or ‘c’.

Let’s identify items that contain either the letter ‘a’ or ‘p’ followed immediately by the letter ‘i’.

char_class_query = df.query("item_name.str.contains('[ap]i')")
print(char_class_query)

Output:

   item_name
3  pineapple
4  apple pie

In this query, the pattern [ap]i captures items that have either ‘ai’ or ‘pi’ in their names.

Ranges

Ranges are a subset of character classes. Instead of listing individual characters, you can specify a range using a hyphen. For instance, [a-c] would match any single character that is ‘a’, ‘b’, or ‘c’.

To illustrate, let’s extract items with names that contain any lowercase letter between ‘a’ and ‘c’ followed by the letter ‘e’.

range_query = df.query("item_name.str.contains('[a-h]e')")
print(range_query)

Output:

  item_name
2    cherry

The pattern [a-h]e matches items that contain substrings like ‘ae’, ‘be’, ‘ce’, ‘de’, ‘ee’, ‘fe’, ‘ge, and ‘he’.

 

Excluding rows based on certain patterns

Using character classes in conjunction with the ^ character (inside square brackets) enables you to negate specific characters.

For instance, if you wish to exclude items containing the letters ‘a’ or ‘p’:

exclude_chars = df.query("not item_name.str.contains('[ap]')")
print(exclude_chars)

Output:

  item_name
2    cherry

The pattern not item_name.str.contains('[ap]') ensures rows with the letters ‘a’ or ‘p’ are excluded from the result.

Excluding Specific Patterns

To exclude rows based on particular patterns, use the not keyword alongside your regex pattern within the query() method.

Say, you want to exclude items that end with ‘apple’:

exclude_pattern = df.query("not item_name.str.contains('apple$')")
print(exclude_pattern)

Output:

      item_name
1        banana
2        cherry
4     apple pie
5  banana split

 

Ignore Case Sensitivity in Regex Queries

By default, the search will differentiate between lowercase and uppercase characters.

To perform a case-insensitive search, you can use the case parameter.

Here’s how to search for ‘apple’ in a case-insensitive way:

import re
case_insensitive_search = df.query('item_name.str.contains("APPLE", case=False)')
print(case_insensitive_search)

Output:

   item_name
0      apple
3  pineapple
4  apple pie

Using the above code, the search will match both ‘apple’ and ‘pineapple’ (and any other string containing ‘apple’ regardless of case).

 

Positive and Negative Lookaheads/Lookbehinds

A positive lookahead asserts what is directly following the current position in the string but doesn’t consume any characters. The syntax is (?=...).

Find Text followed by a Space

For example, to find items that have ‘apple’ followed by a space:

positive_lookahead_query = df.query("item_name.str.contains('apple(?= )')")
print(positive_lookahead_query)

Output:

   item_name
4  apple pie

The pattern ‘apple(?= )’ matches the term ‘apple’ only when it’s directly followed by a space.

Find Text not followed by a space

The syntax is (?!...).

To find items with ‘apple’ not followed by a space:

negative_lookahead_query = df.query("item_name.str.contains('apple(?! )')")
print(negative_lookahead_query)

Output:

   item_name
0      apple
3  pineapple

Find Text with space before it

A positive lookbehind asserts what is directly preceding the current position in the string but, again, doesn’t consume characters. The syntax is (?<=...).

To find items that have a space preceding the word ‘pie’:

positive_lookbehind_query = df.query("item_name.str.contains('(?<= )pie')")
print(positive_lookbehind_query)

Output:

   item_name
4  apple pie

The pattern ‘(?<= )pie’ matches the term ‘pie’ only when it’s directly preceded by a space.

Find Text without space before it

A negative lookbehind asserts that a certain string is not directly preceding the current position. The syntax is (?<!...).

To find items with ‘pie’ not preceded by a space:

negative_lookbehind_query = df.query("item_name.str.contains('(?<! )pi')")
print(negative_lookbehind_query)

Output:

   item_name
3  pineapple

In this case, only ‘pineapple’ contains the term ‘pi’ not preceded by a space.

 

Filtering based on patterns with special characters

To match special characters such as $.^*+?{}[]\()|, you must “escape” them using a backslash \.

For instance, if you wish to find items with the dollar sign $:

import pandas as pd
data = {
    'item_name': ['$apple', 'banana', 'cherry', 'pineapple', 'apple pie', 'banana split']
}
df = pd.DataFrame(data)
special_char_query = df.query("item_name.str.contains('\\$')")
print(special_char_query)

Output:

  item_name
0    $apple

Here, the pattern \\$ ensures that the dollar sign is treated as a literal character and not as a regex metacharacter representing the end of a string.

 

Grouping and Capturing

Groups in regex are created using parentheses (...). They not only group elements of your pattern together but also capture the content matched by the group.

For example, to identify items that start with ‘apple’ followed by any sequence of characters:

grouping_query = df.query("item_name.str.contains('^(apple)(.+)$')")
print(grouping_query)

Output:

   item_name
4  apple pie

Here, ^(apple)(.+)$ has two groups: (apple) and (.+). The first group captures the word ‘apple’, and the second captures any one or more characters.

Capturing Content from Groups

To extract specific captured content, you’d typically use the str.extract() method:

captured_content = df['item_name'].str.extract('^(apple)(.*)$')
print(captured_content)

Output:

       0     1
0  apple      
1    NaN   NaN
2    NaN   NaN
3    NaN   NaN
4  apple   pie
5    NaN   NaN

The DataFrame captured_content has two columns, each representing a capturing group from the regex pattern.

Rows with matches display the captured content, while others show NaN.

Non-Capturing Groups

There are situations where you want to group elements without capturing the matched content.

This can be done using the syntax (?:...).

For example, to group ‘apple’ or ‘banana’ but without capturing:

non_capturing_group_query = df.query("item_name.str.contains('^(?:apple|banana)')")
print(non_capturing_group_query)

Output:

      item_name
0        apple
1       banana
4     apple pie
5  banana split

The pattern ^(?:apple|banana) groups ‘apple’ and ‘banana’ for the alternation but doesn’t capture their matches.

 

Backreferences

A backreference is represented by \n where n is the number corresponding to the captured group you’re referring to. For instance, \1 refers to the first capturing group.

To demonstrate, let’s find items with repeated words, like “apple apple”:

import pandas as pd
data = {
    'item_name': ['apple apple', 'banana', 'cherry', 'pineapple', 'apple pie', 'banana split']
}
df = pd.DataFrame(data)
backreference_query = df.query(r"item_name.str.contains('^(\\w+) \\1$')")
print(backreference_query)

Output:

    item_name
0  apple apple

In the pattern ^(\\w+) \\1$, the first group (\\w+) captures a sequence of word characters.

The backreference \\1 then refers to the content matched by this group, ensuring the same sequence appears again.

Leave a Reply

Your email address will not be published. Required fields are marked *