Filter NumPy array by Regular Expressions (Regex)

In this tutorial, we’ll learn how to filter NumPy arrays using regular expressions, including matching patterns at the start or end of strings, using character classes and ranges, and handling case sensitivity.

We’ll work with lookaheads and lookbehinds, and deal with special characters in strings.

 

 

Match the Start or End

First, import the necessary modules:

import numpy as np
import re

Now, let’s consider a sample dataset:

data = np.array(['TX123', 'RX456', 'TX789', 'AB123', 'TX456'])

Suppose you want to filter this array to find elements that start with ‘TX’. Here’s how you can do it:

pattern = re.compile(r'^TX')
filtered_data = np.array([item for item in data if pattern.match(item)])
print(filtered_data)

Output:

array(['TX123', 'TX789', 'TX456'])

The output shows the elements of the original array that start with ‘TX’.

The ^ in the regex pattern signifies the start of the string, ensuring that the match must occur at the beginning.

Now, let’s filter the same array for elements that end with ‘123’:

pattern = re.compile(r'123$')
filtered_data = np.array([item for item in data if pattern.search(item)])
print(filtered_data)

Output:

['TX123', 'AB123']

This time, the $ in the regex pattern denotes the end of the string.

 

Match a Range of Characters

Imagine you have a NumPy array like this:

data = np.array(['TX123', 'RX456', 'TX789', 'AB123', 'TX456', 'CD789', 'EF123'])

Now, suppose you want to filter this array to find elements that contain a number within a specific range, say 4 to 6:

pattern = re.compile(r'[4-6]')
filtered_data = np.array([item for item in data if pattern.search(item)])
print(filtered_data)

Output:

['RX456' 'TX456']

In this example, [4-6] is a character class that matches any one of the digits 4, 5, or 6.

You can also use character classes to match specific groups of letters.

For instance, if you want to find all elements that start with either ‘T’ or ‘R’, the code would look like this:

pattern = re.compile(r'^[TR]')
filtered_data = np.array([item for item in data if pattern.match(item)])
print(filtered_data)

Output:

['TX123' 'RX456' 'TX789' 'TX456']

Here, the regex pattern ^[TR] matches any string in the array that starts with either ‘T’ or ‘R’.

The ^ ensures the match is at the start of the string, and [TR] is a character class that includes ‘T’ and ‘R’.

 

Ignore Case Sensitivity

Consider an array where the case of characters is inconsistent:

data = np.array(['tx123', 'RX456', 'Tx789', 'ab123', 'TX456', 'cd789', 'EF123'])

If you want to filter this array to find elements that contain ‘TX’, regardless of whether ‘TX’ is in lower case, upper case, or a mix of both. Here’s how to do it:

pattern = re.compile(r'tx', re.IGNORECASE)
filtered_data = np.array([item for item in data if pattern.match(item)])
print(filtered_data)

Output:

['tx123', 'Tx789', 'TX456']

The re.IGNORECASE flag in the re.compile method makes the matching process case insensitive.

As a result, ‘tx’, ‘Tx’, ‘tX’, and ‘TX’ are all considered matches.

 

Positive and Negative Lookaheads/Lookbehinds

Lookaheads and lookbehinds in regex are useful for matching patterns based on what does or does not follow or precede a certain pattern.

Suppose you have the following array of data:

data = np.array(['TX123A', 'RX456B', 'TX789A', 'AB123C', 'TX456B', 'CD789A', 'EF123B'])

Positive Lookahead

Imagine you want to select elements that start with ‘TX’ and are followed somewhere by ‘A’. This is where a positive lookahead comes into play:

pattern = re.compile(r'TX(?=.*A)')
filtered_data = np.array([item for item in data if pattern.search(item)])
print(filtered_data)

Output:

['TX123A' 'TX789A']

Here, (?=.*A) is a positive lookahead that asserts ‘TX’ must be followed by ‘A’ anywhere in the string. The output includes elements starting with ‘TX’ and having ‘A’ in them.

Negative Lookahead

Now, let’s filter for elements that start with ‘TX’ but do not have ‘B’ following them anywhere:

pattern = re.compile(r'TX(?!.*B)')
filtered_data = np.array([item for item in data if pattern.search(item)])
print(filtered_data)

Output:

['TX123A', 'TX789A']

The (?!.*B) is a negative lookahead, ensuring that ‘TX’ is not followed by ‘B’ in the string.

Positive Lookbehind

Next, we’ll use a positive lookbehind to find elements that end with ‘A’ and are preceded by ‘123’:

pattern = re.compile(r'(?<=123)A$')
filtered_data = np.array([item for item in data if pattern.search(item)])
print(filtered_data)

Output:

['AB123C']

Here, (?<=123)A$ is a positive lookbehind, it checks for ‘A’ at the end of the string which must be preceded by ‘123’.

Negative Lookbehind

Finally, let’s find elements ending with ‘A’ but not preceded by ‘789’:

pattern = re.compile(r'(?<!789)A$')
filtered_data = np.array([item for item in data if pattern.search(item)])
print(filtered_data)

Output:

['TX123A']

The (?<!789)A$ is a negative lookbehind, it ensures that ‘A’ at the end of the string is not preceded by ‘789’.

 

Filtering with Special Characters

Special characters in regex, such as . (dot), * (asterisk), ? (question mark), and others, have specific meanings.

However, sometimes you need to match these characters literally in your data.

Consider an array that includes special characters:

data = np.array(['TX1.23', 'RX4*56', 'TX7?89', 'AB1$23', 'TX4%56', 'CD7^89', 'EF1&23'])

Filtering Literal Dots

To filter elements containing a literal dot (.), you need to escape the dot in your pattern:

pattern = re.compile(r'TX1\.23')
filtered_data = np.array([item for item in data if pattern.search(item)])
print(filtered_data)

Output:

['TX1.23']

Here, \. matches a literal dot, differentiating it from its usual meaning in regex, which is to match any character.

Filtering Literal Asterisks

Similarly, to match a literal asterisk (*):

pattern = re.compile(r'RX4\*56')
filtered_data = np.array([item for item in data if pattern.search(item)])
print(filtered_data)

Output:

['RX4*56']

The \* in the pattern ensures that the asterisk is treated as a literal character, not as a quantifier in regex.

Escaping Other Special Characters

This method applies to other special characters as well. For instance, to match a literal question mark (?):

pattern = re.compile(r'TX7\?89')
filtered_data = np.array([item for item in data if pattern.search(item)])
print(filtered_data)

Output:

['TX7?89']

Here, \? helps in matching a literal question mark.

Leave a Reply

Your email address will not be published. Required fields are marked *