Filter NumPy array by Regular Expressions (Regex)
In this tutorial, we’ll learn how to filter NumPy arrays using regular expressions, including matching patterns at the start or end of strings, using character classes and ranges, and handling case sensitivity.
We’ll work with lookaheads and lookbehinds, and deal with special characters in strings.
Match the Start or End
First, import the necessary modules:
import numpy as np import re
Now, let’s consider a sample dataset:
data = np.array(['TX123', 'RX456', 'TX789', 'AB123', 'TX456'])
Suppose you want to filter this array to find elements that start with ‘TX’. Here’s how you can do it:
pattern = re.compile(r'^TX') filtered_data = np.array([item for item in data if pattern.match(item)]) print(filtered_data)
Output:
array(['TX123', 'TX789', 'TX456'])
The output shows the elements of the original array that start with ‘TX’.
The ^
in the regex pattern signifies the start of the string, ensuring that the match must occur at the beginning.
Now, let’s filter the same array for elements that end with ‘123’:
pattern = re.compile(r'123$') filtered_data = np.array([item for item in data if pattern.search(item)]) print(filtered_data)
Output:
['TX123', 'AB123']
This time, the $
in the regex pattern denotes the end of the string.
Match a Range of Characters
Imagine you have a NumPy array like this:
data = np.array(['TX123', 'RX456', 'TX789', 'AB123', 'TX456', 'CD789', 'EF123'])
Now, suppose you want to filter this array to find elements that contain a number within a specific range, say 4 to 6:
pattern = re.compile(r'[4-6]') filtered_data = np.array([item for item in data if pattern.search(item)]) print(filtered_data)
Output:
['RX456' 'TX456']
In this example, [4-6]
is a character class that matches any one of the digits 4, 5, or 6.
You can also use character classes to match specific groups of letters.
For instance, if you want to find all elements that start with either ‘T’ or ‘R’, the code would look like this:
pattern = re.compile(r'^[TR]') filtered_data = np.array([item for item in data if pattern.match(item)]) print(filtered_data)
Output:
['TX123' 'RX456' 'TX789' 'TX456']
Here, the regex pattern ^[TR]
matches any string in the array that starts with either ‘T’ or ‘R’.
The ^
ensures the match is at the start of the string, and [TR]
is a character class that includes ‘T’ and ‘R’.
Ignore Case Sensitivity
Consider an array where the case of characters is inconsistent:
data = np.array(['tx123', 'RX456', 'Tx789', 'ab123', 'TX456', 'cd789', 'EF123'])
If you want to filter this array to find elements that contain ‘TX’, regardless of whether ‘TX’ is in lower case, upper case, or a mix of both. Here’s how to do it:
pattern = re.compile(r'tx', re.IGNORECASE) filtered_data = np.array([item for item in data if pattern.match(item)]) print(filtered_data)
Output:
['tx123', 'Tx789', 'TX456']
The re.IGNORECASE
flag in the re.compile
method makes the matching process case insensitive.
As a result, ‘tx’, ‘Tx’, ‘tX’, and ‘TX’ are all considered matches.
Positive and Negative Lookaheads/Lookbehinds
Lookaheads and lookbehinds in regex are useful for matching patterns based on what does or does not follow or precede a certain pattern.
Suppose you have the following array of data:
data = np.array(['TX123A', 'RX456B', 'TX789A', 'AB123C', 'TX456B', 'CD789A', 'EF123B'])
Positive Lookahead
Imagine you want to select elements that start with ‘TX’ and are followed somewhere by ‘A’. This is where a positive lookahead comes into play:
pattern = re.compile(r'TX(?=.*A)') filtered_data = np.array([item for item in data if pattern.search(item)]) print(filtered_data)
Output:
['TX123A' 'TX789A']
Here, (?=.*A)
is a positive lookahead that asserts ‘TX’ must be followed by ‘A’ anywhere in the string. The output includes elements starting with ‘TX’ and having ‘A’ in them.
Negative Lookahead
Now, let’s filter for elements that start with ‘TX’ but do not have ‘B’ following them anywhere:
pattern = re.compile(r'TX(?!.*B)') filtered_data = np.array([item for item in data if pattern.search(item)]) print(filtered_data)
Output:
['TX123A', 'TX789A']
The (?!.*B)
is a negative lookahead, ensuring that ‘TX’ is not followed by ‘B’ in the string.
Positive Lookbehind
Next, we’ll use a positive lookbehind to find elements that end with ‘A’ and are preceded by ‘123’:
pattern = re.compile(r'(?<=123)A$') filtered_data = np.array([item for item in data if pattern.search(item)]) print(filtered_data)
Output:
['AB123C']
Here, (?<=123)A$
is a positive lookbehind, it checks for ‘A’ at the end of the string which must be preceded by ‘123’.
Negative Lookbehind
Finally, let’s find elements ending with ‘A’ but not preceded by ‘789’:
pattern = re.compile(r'(?<!789)A$') filtered_data = np.array([item for item in data if pattern.search(item)]) print(filtered_data)
Output:
['TX123A']
The (?<!789)A$
is a negative lookbehind, it ensures that ‘A’ at the end of the string is not preceded by ‘789’.
Filtering with Special Characters
Special characters in regex, such as .
(dot), *
(asterisk), ?
(question mark), and others, have specific meanings.
However, sometimes you need to match these characters literally in your data.
Consider an array that includes special characters:
data = np.array(['TX1.23', 'RX4*56', 'TX7?89', 'AB1$23', 'TX4%56', 'CD7^89', 'EF1&23'])
Filtering Literal Dots
To filter elements containing a literal dot (.
), you need to escape the dot in your pattern:
pattern = re.compile(r'TX1\.23') filtered_data = np.array([item for item in data if pattern.search(item)]) print(filtered_data)
Output:
['TX1.23']
Here, \.
matches a literal dot, differentiating it from its usual meaning in regex, which is to match any character.
Filtering Literal Asterisks
Similarly, to match a literal asterisk (*
):
pattern = re.compile(r'RX4\*56') filtered_data = np.array([item for item in data if pattern.search(item)]) print(filtered_data)
Output:
['RX4*56']
The \*
in the pattern ensures that the asterisk is treated as a literal character, not as a quantifier in regex.
Escaping Other Special Characters
This method applies to other special characters as well. For instance, to match a literal question mark (?
):
pattern = re.compile(r'TX7\?89') filtered_data = np.array([item for item in data if pattern.search(item)]) print(filtered_data)
Output:
['TX7?89']
Here, \?
helps in matching a literal question mark.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.