regex tutorial
Linux

Regex tutorial for Linux

In order to successfully working with the Linux sed editor and the awk command in your shell scripts you has to understand regular expressions or in short regex and to be accurate in our case it is bash regex, since there are many engines for regex you can use and we here in this regex tutorial will use the shell regex and see the bash power in working with regex.

First, we need to understand what regex is then we will dive deep into using it

Our main points are:

What is regex

Types of regex

Define BRE Patterns

Special characters

Anchor characters

The dot character

Character classes

Negating character classes

Using ranges

Special character classes

The asterisk

Extended Regular Expressions

Grouping expressions

Practical examples

 

What is regex

For some people when they see the regular expressions for the first time they said what are those ASCII pukes !! well. A regular expression or regex, in general, is a pattern of text you define that a Linux program (in our case) like sed or awk uses to filter text.

The regex pattern makes use of wildcard characters to represent one or more characters in the data stream. We’ve seen some of those wildcard characters when introducing basic Linux commands and see how ls command use wildcard characters to filter output.

Types of regex

There are many different applications use different types of regex in Linux. These include programming languages (Java, Perl, Python,,,) and Linux programs like (sed, awk, grep,) and many other applications

A regex is implemented using a regular expression engine. A regular expression engine is the underlying software that interprets regular expression patterns and uses those patterns to match the text.

Linux has two regular expression engines:

  • The POSIX Basic Regular Expression (BRE) engine
  • The POSIX Extended Regular Expression (ERE) engine

Most Linux programs at a minimum conform to the POSIX BRE engine specifications, recognizing all the pattern symbols it defines. Unfortunately, some utilities (such as the sed) conform only to a subset of the BRE engine specifications. This is due to speed constraints because the sed attempts to process text quickly as possible.

The POSIX ERE engine is often found in programming languages. It provides advanced pattern symbols as well as special symbols for common patterns, such as matching digits, and words. The awk command uses the ERE engine to process its regular expression patterns

And because there are so many different ways to implement regex, it’s hard to write patterns that work on all engines. Hence we will focus on the most commonly found regex and demonstrate how to use them in the sed and awk.

Define BRE Patterns

The most basic BRE pattern is matching text characters in a data stream and we’ve seen that using sed and awk but let’s refresh our memory

$ echo "This is a test" | sed -n '/test/p'

$ echo "This is a test" | awk '/test/{print $0}'

regex tutorial

regex awk example

You may notice that the regex doesn’t care where in the data stream the pattern occurs. It also doesn’t matter how many times the pattern occurs. After the regex can match the pattern anywhere in the text string, it passes the string along to the Linux program that’s using it.

The first rule to remember is that regular expression patterns are case sensitive.

$ echo "This is a test" | awk '/Test/{print $0}'

$ echo "This is a test" | awk '/test/{print $0}'

regex character case

The first regex found no match because the word “test” doesn’t appear in uppercase in the text string, while the second line, which uses the lowercase letter in the pattern, worked just fine

You also don’t have to limit yourself to single text words in the regular expression. You can include spaces and numbers in your text string as well

$ echo "This is a test 2 again" | awk '/test 2/{print $0}'

regex space character

Spaces are treated just like any other character in regex.

Special characters

There are a few exceptions when defining text characters in a regex.

regex patterns assign a special meaning to a few characters. If you try to use these characters in your text pattern, you won’t get the results you were expecting

These special characters are recognized by regex

.*[]^${}\+?|()

If you want to use one of the special characters as a text character, you need to escape it

The special character that does this is the backslash character (\).

For example, if you want to search for a dollar sign in your text, just precede it with a backslash character like this

$ awk '/\$/{print $0}' myfile

regex dollar sign

Also, backslash itself is a special character, if you need to use it in a regex pattern, you need to escape it as well, producing a double backslash

$ echo "\ is a special character" | awk '/\\/{print $0}'

regex special character

Although the forward slash isn’t a regular expression special character, if you use it in your regular expression pattern in sed or awk, you still get an error

$ echo "3 / 2" | awk '///{print $0}'

regex slash

So you need to escape it like this

$ echo "3 / 2" | awk '/\//{print $0}'

regex escape slash

Anchor characters

You can use two special characters to anchor a pattern to either the beginning or the end of lines in the text

The caret character (^) defines a pattern that starts at the beginning of a line of text in the text

If the pattern is located any place other than the start of the line of text, the regex pattern fails

You can use it like this

$ echo "welcome to likegeeks website" | awk '/^likegeeks/{print $0}'

$ echo "likegeeks website" | awk '/^likegeeks/{print $0}'

regex anchor begin character

The caret anchor character checks for the pattern at the beginning of each new line of data

$ awk '/^this/{print $0}' myfile

regex caret anchor

Great!! When using sed, if you position the caret character in any place other than at the beginning of the pattern, it acts like a normal character and not as a special character

$ echo "This ^ is a test" | sed -n '/s ^/p'

regex caret character

But if you use awk you have to escape it like this

$ echo "This ^ is a test" | awk '/s \^/{print $0}'

regex escape caret

This about looking at the beginning of text what about looking at the end

The dollar sign ($) special character defines the end anchor

$ echo "This is a test" | awk '/test$/{print $0}'

regex end anchor

You can combine both the start and end anchor on the same line like this

$ awk '/^this is a test$/{print $0}' myfile

regex combine anchors

As you can see it prints only the line that has the matching pattern only

You can filter blank lines with the following pattern

$ awk '!/^$/{print $0}' myfile

Here we introduce the negation which is done by the exclamation mark !

The pattern looks for lines that have nothing between the start and end of the line and negates that to print only the lines have text.

The dot character

The dot character is used to match any single character except a newline character

Look at the following example to get the idea

$ awk '/.st/{print $0}' myfile

regex dot character

You can see from the result that it prints only the first two lines because they contain the st pattern while the third line does not have that pattern and fourth line start with st so that also doesn’t match our pattern.

Character classes

You can match any character with the dot special character but what if you want to limit what characters to match. This is called a character class

You can define a set of characters that would match a position in a text pattern. If one of the characters from the character set is in the text, it matches the pattern

To define a character class, you use square brackets [] like this

$ awk '/[oi]th/{print $0}' myfile

regex character classes

Here we search for any th character that has o character or I before it.

This comes handy when you are searching for words that may contain upper or lower case and you are not sure about that.

$ echo "this is a test" | awk '/[Tt]his is a test/{print $0}'

$ echo "This is a test" | awk '/[Tt]his is a test/{print $0}'

regex upper and lower case

Of course, it is not limited to characters; you can use numbers or whatever you want. You can employ it as you want as long as you got the idea.

Negating character classes

You can also reverse the effect of a character class. Instead of looking for a character contained in the class, you can look for any character that’s not in the class. To do that, just place a caret character at the beginning of the character class range.

$ awk '/[^oi]th/{print $0}' myfile

regex negate character classes

By negating the character class, the regex pattern matches any character that’s neither o nor an i.

Using ranges

You can use a range of characters within a character class by using the dash symbol like this.

$ awk '/[e-p]st/{print $0}' myfile

regex ranges

This matches all characters between e and p then followed by st as shown

You can also use ranges for numbers

$ echo "123" | awk '/[0-9][0-9][0-9]/'

$ echo "12a" | awk '/[0-9][0-9][0-9]/'

regex number range

You can also specify multiple, non-continuous ranges in a single character class

$ awk '/[a-fm-z]st/{print $0}' myfile

regex non-continuous range

The character class allows the ranges a through f, and m through z to appear before the st text.

Special character classes

The BRE contains special character classes you can use to match against specific types of characters

And this is the list

[[:alpha:]]                            Matches any alphabetical character, either upper or lower case

[[:alnum:]]                          Matches any alphanumeric character 0–9, A–Z, or a–z

[[:blank:]]                            Matches a space or Tab character

[[:digit:]]                              Matches a numerical digit from 0 through 9

[[:lower:]]                           Matches any lowercase alphabetical character a–z

[[:print:]]                             Matches any printable character

[[:punct:]]                           Matches a punctuation character

[[:space:]]                           Matches any whitespace character: space, Tab, NL, FF, VT, CR

[[:upper:]]                          Matches any uppercase alphabetical character A–Z

You can use them like this

$ echo "abc" | awk '/[[:alpha:]]/{print $0}'

$ echo "abc" | awk '/[[:digit:]]/{print $0}'

$ echo "abc123" | awk '/[[:digit:]]/{print $0}'

regex special character classes

The asterisk

Placing an asterisk after a character signifies that the character must appear zero or more times in the text to match the pattern

$ echo "test" | awk '/tes*t/{print $0}'

$ echo "tessst" | awk '/tes*t/{print $0}'

regex asterisk

This pattern symbol is commonly used for handling words that have a common misspelling or variations in language spellings

$ echo "I like green color" | awk '/colou*r/{print $0}'

$ echo "I like green colour " | awk '/colou*r/{print $0}'

regex asterisk example

Here in those examples whether you type it color or colour it will match because the asterisk means if the u character existed many time or zero time that will match.

Another handy feature is combining the dot character with the asterisk character. This combination provides a pattern to match any number of any characters.

$ awk '/this.*test/{print $0}' myfile

regex asterisk with dot

It doesn’t matter how many words between the words this and test, any line will match will be printed.

The asterisk can also be applied to a character class.

$ echo "st" | awk '/s[ae]*t/{print $0}'

$ echo "sat" | awk '/s[ae]*t/{print $0}'

$ echo "set" | awk '/s[ae]*t/{print $0}'

asterisk with character classes

All three examples match because the asterisk means if you find zero times or more of a character or e print it.

Extended Regular Expressions

The POSIX ERE patterns include a few additional symbols that are used by some Linux applications and utilities. The awk command recognizes the ERE patterns, but sed doesn’t.

We will discuss the commonly used ERE pattern symbols that you can use in your awk program scripts.

The question mark

The question mark indicates that the preceding character can appear zero or one time so no repeating here

$ echo "tet" | awk '/tes?t/{print $0}'

$ echo "test" | awk '/tes?t/{print $0}'

$ echo "tesst" | awk '/tes?t/{print $0}'

regex question mark

You can use the question mark symbol along with a character class

$ echo "tst" | awk '/t[ae]?st/{print $0}'

$ echo "test" | awk '/t[ae]?st/{print $0}'

$ echo "tast" | awk '/t[ae]?st/{print $0}'

$ echo "taest" | awk '/t[ae]?st/{print $0}'

$ echo "teest" | awk '/t[ae]?st/{print $0}'

regex question mark with character classes

If zero or one character from the character class appears, the pattern match passes.

But if both characters appear, or if one of the characters appears twice, the pattern match fails

The plus sign

The plus sign indicates that the preceding character can appear one or more times, but must be present at least once

$ echo "test" | awk '/te+st/{print $0}'

$ echo "teest" | awk '/te+st/{print $0}'

$ echo "tst" | awk '/te+st/{print $0}'

regex plus sign

If the e character is not present, the pattern match fails. The plus sign also works with character classes, the same way as the asterisk and question mark

$ echo "tst" | awk '/t[ae]+st/{print $0}'

$ echo "test" | awk '/t[ae]+st/{print $0}'

$ echo "teast" | awk '/t[ae]+st/{print $0}'

$ echo "teeast" | awk '/t[ae]+st/{print $0}'

regex plus sign with character classes

This time if either character defined in the character class appears, the text matches the specified pattern.

Curly braces

Curly braces are available in ERE to allow you to specify a limit on a repeatable regex, it has two formats

n: The regex appears exactly n times.

n,m: The regex appears at least n times, but no more than m times

$ echo "tst" | awk '/te{1}st/{print $0}'

$ echo "test" | awk '/te{1}st/{print $0}'

regex curly braces

In old versions of awk, you should use –re-interval command line option for the awk command to recognize regular expression intervals but now you don’t need it

$ echo "tst" | awk '/te{1,2}st/{print $0}'

$ echo "test" | awk '/te{1,2}st/{print $0}'

$ echo "teest" | awk '/te{1,2}st/{print $0}'

$ echo "teeest" | awk '/te{1,2}st/{print $0}'

regex curly braces interval pattern

In this example, the e character can appear once or twice for the pattern match to pass; otherwise, the pattern match fails

The interval pattern match also applies to character classes the same way as we did so we don’t have to do it again

$ echo "tst" | awk  '/t[ae]{1,2}st/{print $0}'

$ echo "test" | awk  '/t[ae]{1,2}st/{print $0}'

$ echo "teest" | awk  '/t[ae]{1,2}st/{print $0}'

$ echo "teeast" | awk  '/t[ae]{1,2}st/{print $0}'

regex interval pattern with character classes

This regex pattern matches if there are exactly one or two instances of the letter a or e in the text pattern, but it fails if there are any more in any combination

The pipe symbol

The pipe symbol allows to you to specify two or more patterns that the regex engine uses in a logical OR formula when examining the data stream. If any of the patterns match the text, the text passes. If none of the patterns matches, the pattern will fail, here is an example

$ echo "This is a test" | awk '/test|exam/{print $0}'

$ echo "This is an exam" | awk '/test|exam/{print $0}'

$ echo "This is something else" | awk '/test|exam/{print $0}'

regex pipe symbol

This example looks for the regular expression test or exam in the text. Keep in mind that you can’t place any spaces within the regular expressions and the pipe symbol.

Grouping expressions

Regex patterns can also be grouped by using parentheses. When you group a regex pattern, the group is treated like a standard character. You can apply a special character to the group just as you would to a regular character

$ echo "Like" | awk '/Like(Geeks)?/{print $0}'

$ echo "LikeGeeks" | awk '/Like(Geeks)?/{print $0}'

regex grouping expressions

The grouping of the “Geeks” ending along with the question mark allows the pattern to match either the full day name LikeGeeks or the word Like only.

Practical examples

We’ve seen some simple demonstrations of using regular expression patterns, it’s time to put that in action, just for practicing and I always say more practicing more power

Counting directory files

Let’s look at a bash script that counts the executable files that are present in the directories defined in your PATH environment variable. To do that, you need to parse out the PATH variable into separate directory names.

$ echo $PATH

To get a listing of directories that you can use in a script, you must replace each colon with space.

$ echo $PATH | sed 's/:/ /g'

Now let’s iterate through each directory using for loop like this

Great!!

Now we can use the ls command to list each file in each directory and save the count in a variable

You may notice some directory not existed, no problem with this

regex count files

Cool!! This is the power of regex. Those few lines of code count all files in all directories. Of course, there is a Linux command to do that very easy but here we introduce how to employ regex in something you can use and with some brain ideas you can come up with some more useful.

Validating e-mail address

There are a ton of websites that offers read to use regex patterns for everything e-mail, phone number and much more this is handy but we want to understand how it works

username@hostname.com

The username can use any alphanumeric characters combined with dot, dash, plus sign, underscore

The hostname can use any alphanumeric characters combined with dot and underscore

Let’s start building the regular expression pattern from the left side. We know that there can be multiple valid characters in the username. This should be very easy

^([a-zA-Z0-9_\-\.\+]+)@

This grouping specifies the allowed characters in the username and the plus sign to indicate that at least one character must be present or more then the @ sign

Then the hostname pattern should be like this

([a-zA-Z0-9_\-\.]+)

There are special rules for the top-level domain. Top-level domains are only alphabetic characters, and they must be no fewer than two characters (used in country codes) and no more than five characters in length. The following is the regex pattern for the top-level domain

\.([a-zA-Z]{2,5})$

Now we put them all together

^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Let’s test that regex against an email

$ echo "name@host.com" | awk '/^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$/{print $0}'

$ echo "name@host.com.us" | awk '/^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$/{print $0}'

regex validate email

Awesome!! Works great

This was just the beginning for regex world that never ends I hope after this post you understand these ASCII pukes J  and use it more professionally

This is for now, hope you like the post

Thank you.