Advanced Regex with Linux find Command

The find command in Linux allows you to search for files and directories within a directory hierarchy based on different criteria.

One powerful feature of find is its ability to search using regular expressions. Regular expressions, or regex for short, provide a method to match a sequence of characters in a string.

 

 

Regular expressions Engines (BRE) and  (ERE)

find command supports basic regular expressions (BRE) and extended regular expressions (ERE).

BRE: Uses a more limited set of metacharacters. Some of the metacharacters, such as +, ?, and |, do not exist in BRE and their usage requires a backslash.

find /path -regex 'pattern_using_BRE'

ERE: Offers a more extensive set of metacharacters and is more expressive. You can switch to ERE using the -E option with find.

find /path -E -regex 'pattern_using_ERE'

Here’s a table that summarizes the difference between BRE and ERE:

Feature BRE ERE
Metacharacters Limited set: . * ^ $ [ ] Extended set including: . * ^ $ [ ] + ? { }
Usage with find Default mode for -regex Use -E option
Grouping Not supported Supported with ()
Alternation Not supported Supported with |
Escaping metacharacters \ used to escape metacharacters \ used to escape and introduce metacharacters
Escaping . * ^ $ [ ] Required Not required

 

Understanding the full path matching behavior

When you use the -regex option with find, it matches against the entire path, not just the filename. This is essential to remember because your regular expression should consider the path structure.

Let’s assume you have a directory structure that looks something like this:

/path/to/dir/
    |
    |-- fileA.txt
    |-- subdirectory/
    |   |-- fileB.txt

If you execute:

find /path/to/dir -regex '.*fileA.txt'

You will successfully match /path/to/dir/fileA.txt.

However, if you execute:

find /path/to/dir -regex 'fileA.txt'

You won’t get any matches. This is because the pattern needs to account for the full path.

To match based on just the filename, you’d typically combine find with other tools like basename or use other tests like -name. For instance:

find /path/to/dir -name 'fileA.txt'

Would successfully match based on the filename alone.

 

Understanding special characters

These are some of the most commonly used metacharacters in regular expressions:

  • .: Matches any single character.
  • ^: Asserts the start of a line.
  • $: Asserts the end of a line.
  • *: Matches the previous element zero or more times.
  • +: Matches the previous element one or more times.
  • ?: Matches the previous element zero or one time.
  • \: Escapes the following character, turning any metacharacter into a literal.
  • |: Acts as a logical OR. Matches either the pattern before or the pattern after it.
  • (): Groups multiple patterns into a single unit.

These metacharacters are the foundation of pattern matching with regex.

 

Understanding anchors

Anchors are special characters in regular expressions that denote positions in a string rather than actual content. The two most common anchors are:

  • ^: This denotes the beginning of a line or string.
  • $: This represents the end of a line or string.

Example 1: Finding configuration files that start with “nginx”:

find /etc/ -regex '.*/nginx[^/]*\.conf$'

This command would locate files like /etc/nginx/nginx.conf or /etc/nginx/sites-available/nginx-default.conf, but not something like /etc/apache2/nginx-mimic.conf.

Example 2: Locating log files ending with “2023-08”:

find /var/log/ -regex '.*2023-08[^/]*\.log$'

This command identifies log files such as /var/log/syslog-2023-08-19.log or /var/log/auth-2023-08-20.log.

 

Finding files with any single character

In the context of regular expressions, the ? character typically represents zero or one occurrence of the preceding character or group.

However, when many people think about matching any single character, they might be referring to the . character in regular expressions. Let’s cover both.

Matching Any Single Character with dot

The . (dot) in regular expressions is a special character that matches any single character except a newline.

Let’s say you have a directory with the following files:

dir/
    |
    |-- a1.txt
    |-- a2.txt
    |-- a3.txt
    |-- aX.txt
    |-- ab.txt

To find files that have a pattern of “a, any single character, .txt”:

find dir/ -regex './a.\.txt'

This would match a1.txt, a2.txt, a3.txt, and aX.txt but not ab.txt because it has two characters after the “a”.

The Use of ? in Regular Expressions

The ? in regular expressions represents zero or one occurrence of the preceding character or group. This can be useful when you’re uncertain about the presence of a character.

If you have files named color.txt and colour.txt and you want to match both:

find dir/ -regex './colou?r\.txt'

This pattern matches both color.txt and colour.txt, accounting for the optional “u”.

 

Using + Quantifier

Suppose you have a folder with log files. Some logs have a date format, and you want to pick out files that specifically have numbers.

find logs/ -regex './log\d+\.txt'

This will match log1.txt and log20230819.txt, but exclude log.txt.

 

Utilizing the wildcard (*)

The * character is known as the “wildcard” and it matches zero or more occurrences of the preceding character or group.

The * character is useful when the number of characters you’re trying to match is unknown.

Imagine a directory containing the following files:

docs/
    |
    |-- product.txt
    |-- production.txt
    |-- producer.txt
    |-- produce.txt

To match files starting with “produc” followed by any number of characters:

find docs/ -regex './produc.*\.txt'

This would match product.txt, production.txt, and produce.txt, but not producer.txt because it doesn’t match the .txt ending in the regex.

The * wildcard can also match zero occurrences, effectively making the preceding character optional.

Consider files named data.txt, data1.txt, data12.txt, and so forth:

find /path/ -regex './data[0-9]*\.txt'

This would match data.txt (with zero occurrences of [0-9]), data1.txt, data12.txt, and any other file starting with data followed by zero or more numbers.

Remember, the * wildcard in regex isn’t the same as the * wildcard in shell globbing.

In the shell, * matches any sequence of characters, but in regex, it specifies the quantity of the preceding character or group.

 

Defining custom character classes

Character classes allow you to define a specific set of characters to match.

  • [...]: Matches any one of the characters enclosed in the square brackets.
  • [^...]: Matches any character NOT enclosed in the square brackets.
find /path -regex '.*/file[123].*'

This command searches for files named like “file1”, “file2”, or “file3”.

On the other hand:

find /path -regex '.*/file[^123].*'

This will search for files that do not have names like “file1”, “file2”, or “file3”.

Example 1: Finding files starting with numbers

When you want to find files or directories starting with numbers, you can use the [0-9] character class.

find /path -regex '.*/[0-9].*'

This command targets files or directories within the specified path that start with any digit from 0 to 9.

Example 2: Excluding files starting with vowels

If you want to exclude files or directories that start with vowels, you can use the [^...] notation to negate a character class.

find /path -regex '.*/[^aeiouAEIOU].*'

This command finds files or directories that do not start with a vowel, considering both lowercase and uppercase vowels.

 

Recognizing shorthands

Regular expressions offer shorthand character classes for common patterns:

  • \d: Matches any digit (0-9). Equivalent to [0-9].
  • \w: Matches any word character (alphanumeric characters plus underscore). Equivalent to [a-zA-Z0-9_].
  • \s: Matches any whitespace character (spaces, tabs, etc.).

Their uppercase represents the negation:

  • \D: Matches any non-digit.
  • \W: Matches any non-word character.
  • \S: Matches any non-whitespace character.
find /path -regex '.*/\d.*'

This command will find files or directories starting with a digit.

Example 1: Finding files with digits in their names

To locate files or directories that contain at least one digit in their name, you can utilize the \d shorthand.

find /path -regex '.*\d.*'

This command searches within the given path for files or directories containing any digit from 0 to 9 in their name.

Example 2: Locating files with word characters

To search for files or directories containing word characters, you can use the \w shorthand which matches any alphanumeric character or underscore.

find /path -regex '.*\w.*'

The command looks for files or directories within the specified path that have at least one word character in their name.

Example 3: Identifying files with space in their names

Files get saved with spaces in their names, and these can sometimes be problematic in scripts or automated processes. You need to locate them.

find media/ -regex '.*\s.*'

This command will get files like summer photos.jpg, project plan.docx, but will exclude names like data_summary.xlsx.

 

Repetitions Quantifiers

Quantifiers determine how many times the preceding element should match:

  • {n}: Matches the previous exactly n times.
  • {n,}: Matches the previous element at least n times.
  • {n,m}: Matches the previous element between n and m times, inclusive.
find /path -regex '.*file\d{3}.*'

This command searches for files or directories with names containing “file” followed by exactly three digits.

 

Finding files with repeated patterns

To identify files with repeated patterns using {n,m} quantifiers, you specify the range of times a pattern should appear.

find /path -regex '.*pattern.{n,m}.*'

This command will search for files or directories where the word “pattern” appears between n and m times in their name.

Let’s say you want to find configuration files that have a pattern where there’s a repeated occurrence of digits, like IP addresses or version numbers.

find /path -regex '.*\(\d{1,3}\.\)\{3\}\d{1,3}.*'

This command searches for filenames resembling IP address patterns, like “192.168.1.1” or “10.0.0.1”, where each number can be one to three digits long and is separated by periods.

 

Case-insensitive searching

At times, you might want to perform a search that doesn’t differentiate between uppercase and lowercase letters. The -iregex option allows for case-insensitive matching.

To find files named “config” regardless of case:

find /path -iregex '.*config.*'

This command locates files with names containing “config”, “Config”, “CONFIG”, or any other case variation of the word.

 

Mixing multiple patterns using |

When you want to search for files that match any of several patterns, you can use the | operator in extended regular expressions.

Example 1:Finding files matching one of several extensions

You have a directory with different types of media files, and you want to pick out all the image files with extensions jpg, png, or gif.

find media/ -regex '.*\.\(jpg\|png\|gif\)'

This command matches files such as photo.jpg, icon.png, and animation.gif.

Example 2: Locating files by multiple naming conventions

In a configuration directory, there might be two naming conventions – files ending with -config.txt or -configuration.txt.

find config/ -regex '.*\-\(config\|configuration\)\.txt'

This will match both server-config.txt and database-configuration.txt.

Example 3: Searching for backup or temporary files

Backup (ending in .bak) or temporary files (ending in ~) may accumulate. You want to identify them all in a directory.

find projects/ -regex '.*\.\(bak\|~\)'

This command matches files like code.bak or document.txt~.

Leave a Reply

Your email address will not be published. Required fields are marked *