regex tutorial
Linux

Regex tutorial for Linux

In order to successfully work with the Linux sed editor and the awk command in your shell scripts, you have to understand regular expressions or in short regex. Since there are many engines for regex, we will use the shell regex and see the bash power in working with regex.

First, we need to understand what regex is, then we will see how to use it.

 

What is regex

For some people, when they see the regular expressions for the first time they said what are these ASCII pukes !!

Well, A regular expression or regex, in general, is a pattern of text you define that a Linux program like sed or awk uses it to filter text.

The regex pattern uses the wildcard characters to represent one or more characters in the data stream. We saw some of those wildcard characters when introducing basic Linux commands and saw how the ls command uses wildcard characters to filter output.

Types of regex

There are many different applications use different types of regex in Linux, like the regex included in programming languages (Java, Perl, Python,,,) and Linux programs like (sed, awk, grep,) and many other applications.

A regex is implemented using a regular expression engine. A regular expression engine is an underlying software that interprets regular expression patterns and uses these patterns to match the text.

Linux has two regular expression engines:

  • The Basic Regular Expression (BRE) engine.
  • The Extended Regular Expression (ERE) engine.

Most Linux programs work well with BRE engine specifications, but some tools like sed which takes only to a subset of the BRE engine rules.

The POSIX ERE engine is shipped with some programming languages. It provides more patterns like matching digits, and words. The awk command uses the ERE engine to process its regular expression patterns.

And because there are so many different ways to implement regex, it’s hard to write patterns that work on all engines. Hence, we will focus on the most commonly found regex and demonstrate how to use it in the sed and awk.

Define BRE Patterns

You can define a pattern to match text like this:

$ echo "This is a test" | sed -n '/test/p'

$ echo "This is a test" | awk '/test/{print $0}'

regex tutorial

regex awk example

You may notice that the regex doesn’t care where the pattern occurs or how many times in the data stream. After the regex matches the pattern anywhere in the text string, it passes the string along with the Linux program that is using it.

The first rule to know is that regular expression patterns are case sensitive.

$ echo "Welcome to LikeGeeks" | awk '/Geeks/{print $0}'

$ echo "Welcome to Likegeeks" | awk '/Geeks/{print $0}'

regex character case

The first regex found no match because the word “Geeks” doesn’t appear in uppercase in the text string, while the second line, which uses the lowercase letter in the pattern, worked just fine.

You also don’t have to restrict yourself to single text words in the regular expression. Patterns could include spaces or numbers like this:

$ echo "This is a test 2 again" | awk '/test 2/{print $0}'

regex space character

Special characters

regex patterns use some special characters. And you can’t include them in your patterns and if you do so, you won’t get the expected result.

These special characters are recognized by regex:

.*[]^${}\+?|()

You need to escape these special characters before using them.

The character that does this is the backslash character (\).

For example, if you want to search for a dollar sign in your text, escape it with a backslash character like this:

$ awk '/\$/{print $0}' myfile

regex dollar sign

Also, backslash itself is a special character, if you need to use it in a regex pattern, you need to escape it as well, producing a double backslash.

$ echo "\ is a special character" | awk '/\\/{print $0}'

regex special character

Despite the forward slash isn’t a regular expression special character, if you use it in your regular expression pattern in sed or awk, you still get an error.

$ echo "3 / 2" | awk '///{print $0}'

regex slash

So you need to escape it like this:

$ echo "3 / 2" | awk '/\//{print $0}'

regex escape slash

Anchor characters

To locate the beginning of a line in a text, use the caret character (^).

If the pattern is located in any place other than the start of the line, the regex pattern fails.

You can use it like this:

$ echo "welcome to likegeeks website" | awk '/^likegeeks/{print $0}'

$ echo "likegeeks website" | awk '/^likegeeks/{print $0}'

regex anchor begin character

The caret character (^) checks for the pattern at the beginning of each new line of data:

$ awk '/^this/{print $0}' myfile

regex caret anchor

Awesome!! When using sed, if you put the caret character in any place other than the beginning of the pattern, it will act like a normal character and not as a special character.

$ echo "This ^ is a test" | sed -n '/s ^/p'

regex caret character

When using awk, you have to escape it like this:

$ echo "This ^ is a test" | awk '/s \^/{print $0}'

regex escape caret

This is about looking at the beginning of the text, what about looking at the end?

The dollar sign ($) checks for the end a line:

$ echo "This is a test" | awk '/test$/{print $0}'

regex end anchor

You can use both the start and end anchor on the same line like this:

$ awk '/^this is a test$/{print $0}' myfile

regex combine anchors

As you can see, it prints only the line that has the matching pattern only.

You can filter blank lines with the following pattern:

$ awk '!/^$/{print $0}' myfile

Here we introduce the negation which is done by the exclamation mark !

The pattern looks for lines that have nothing between the start and end of the line and negates that to print only the lines have text.

The dot character

The dot character is used to match any single character except a newline character.

Look at the following example to get the idea:

$ awk '/.st/{print $0}' myfile

regex dot character

You can see from the result that it prints only the first two lines because they contain the st pattern while the third line does not have that pattern and fourth line start with st so that also doesn’t match our pattern.

Character classes

You can match any character with the dot special character, but what if you want to limit the characters matching. In this case, you can use a character class.

The character class matches a set of characters if any of them found, the pattern matches.

To define a character class, you use square brackets [] like this:

$ awk '/[oi]th/{print $0}' myfile

regex character classes

Here we search for any th characters that have o character or i before it.

This comes handy when you are searching for words that may contain upper or lower case and you are not sure about that.

$ echo "this is a test" | awk '/[Tt]his is a test/{print $0}'

$ echo "This is a test" | awk '/[Tt]his is a test/{print $0}'

regex upper and lower case

Of course, it is not limited to characters; you can use numbers or whatever you want. You can employ it as you want as long as you got the idea.

Negating character classes

You can also reverse the effect of a character class. Instead of searching for a character included in a class, you can look for any character that’s not in a class. To achieve that, type a caret character at the beginning of the character class range like this:

$ awk '/[^oi]th/{print $0}' myfile

regex negate character classes

By negating the character class, the regex pattern matches any character that’s neither o nor i.

Using ranges

You can use a range of characters inside a character class by using the dash symbol like this:

$ awk '/[e-p]st/{print $0}' myfile

regex ranges

This matches all characters between e and p then followed by st as shown.

You can also use ranges for numbers:

$ echo "123" | awk '/[0-9][0-9][0-9]/'

$ echo "12a" | awk '/[0-9][0-9][0-9]/'

regex number range

You can use multiple and non-continuous ranges in a single character class:

$ awk '/[a-fm-z]st/{print $0}' myfile

regex non-continuous range

The character class allows the ranges a to f, and m to z to appear before the st text.

Special Character Classes

The following list includes the special character classes which you can use them:

[[:alpha:]]                            Pattern for any alphabetical character, either upper or lower case.

[[:alnum:]]                          Pattern for  0–9, A–Z, or a–z.

[[:blank:]]                            Pattern for space or Tab only.

[[:digit:]]                              Pattern for 0 to 9.

[[:lower:]]                            Pattern for a–z lower case only.

[[:print:]]                            Pattern for any printable character.

[[:punct:]]                           Pattern for any punctuation character.

[[:space:]]                          Pattern for any whitespace character: space, Tab, NL, FF, VT, CR.

[[:upper:]]                          Pattern for A–Z upper case only.

You can use them like this:

$ echo "abc" | awk '/[[:alpha:]]/{print $0}'

$ echo "abc" | awk '/[[:digit:]]/{print $0}'

$ echo "abc123" | awk '/[[:digit:]]/{print $0}'

regex special character classes

The asterisk

Putting an asterisk after a character signifies that the character must appear zero or more times in the text to match the pattern.

$ echo "test" | awk '/tes*t/{print $0}'

$ echo "tessst" | awk '/tes*t/{print $0}'

regex asterisk

This pattern symbol is usually used for handling words that have a common misspelling or variations in language spellings.

$ echo "I like green color" | awk '/colou*r/{print $0}'

$ echo "I like green colour " | awk '/colou*r/{print $0}'

regex asterisk example

Here in these examples whether you type it color or colour it will match, because the asterisk means if the “u” character existed many times or zero time that will match.

Another handy feature is combining the dot character with the asterisk character. This combination gives a pattern to match any number of any characters.

$ awk '/this.*test/{print $0}' myfile

regex asterisk with dot

It doesn’t matter how many words between the words “this” and “test”, any line matches, will be printed.

The asterisk character can also be applied to a character class.

$ echo "st" | awk '/s[ae]*t/{print $0}'

$ echo "sat" | awk '/s[ae]*t/{print $0}'

$ echo "set" | awk '/s[ae]*t/{print $0}'

asterisk with character classes

All three examples match because the asterisk means if you find zero times or more any “a” character or “e” print it.

Extended Regular Expressions

The POSIX ERE patterns have a few additional symbols that are used by some Linux apps and utilities. The awk command recognizes the ERE patterns, but sed doesn’t.

We will discuss the commonly used ERE pattern symbols that you can use in your awk program scripts.

The question mark

The question mark is used to indicate that the preceding character can appear zero or one time.

$ echo "tet" | awk '/tes?t/{print $0}'

$ echo "test" | awk '/tes?t/{print $0}'

$ echo "tesst" | awk '/tes?t/{print $0}'

regex question mark

The question mark can be used in combination with a character class:

$ echo "tst" | awk '/t[ae]?st/{print $0}'

$ echo "test" | awk '/t[ae]?st/{print $0}'

$ echo "tast" | awk '/t[ae]?st/{print $0}'

$ echo "taest" | awk '/t[ae]?st/{print $0}'

$ echo "teest" | awk '/t[ae]?st/{print $0}'

regex question mark with character classes

If 0 or 1 character from the character class exists, the pattern matching passes.

But if both characters appear, or if one of the characters appears twice, the pattern will fail.

The plus sign

The plus sign means that the preceding character can appear one or more times, but must be present at least once.

$ echo "test" | awk '/te+st/{print $0}'

$ echo "teest" | awk '/te+st/{print $0}'

$ echo "tst" | awk '/te+st/{print $0}'

regex plus sign

If the “e” character is not present, the pattern will fail. The plus sign also works along with character classes, just like the asterisk and question mark.

$ echo "tst" | awk '/t[ae]+st/{print $0}'

$ echo "test" | awk '/t[ae]+st/{print $0}'

$ echo "teast" | awk '/t[ae]+st/{print $0}'

$ echo "teeast" | awk '/t[ae]+st/{print $0}'

regex plus sign with character classes

Here if either character defined in the character class appears, the text matches the specified pattern.

Curly braces

Curly braces are available in ERE to allow you to specify a limit on a repeatable regex, it has two formats:

n: The regex appears exactly n times.

n,m: The regex appears at least n times, but no more than m times.

$ echo "tst" | awk '/te{1}st/{print $0}'

$ echo "test" | awk '/te{1}st/{print $0}'

regex curly braces

In old versions of awk, you should use –re-interval command line option for the awk command to recognize regular expression intervals, but in newer versions you don’t need it.

$ echo "tst" | awk '/te{1,2}st/{print $0}'

$ echo "test" | awk '/te{1,2}st/{print $0}'

$ echo "teest" | awk '/te{1,2}st/{print $0}'

$ echo "teeest" | awk '/te{1,2}st/{print $0}'

regex curly braces interval pattern

In this example, the “e” character must appear once or twice in the text to pass; otherwise, the pattern will fail.

You can use interval pattern match to the character classes the same way as we did above:

$ echo "tst" | awk  '/t[ae]{1,2}st/{print $0}'

$ echo "test" | awk  '/t[ae]{1,2}st/{print $0}'

$ echo "teest" | awk  '/t[ae]{1,2}st/{print $0}'

$ echo "teeast" | awk  '/t[ae]{1,2}st/{print $0}'

regex interval pattern with character classes

This regex pattern matches if there are exactly one or two instances of the letter “a” or “e” in the text pattern, but it fails if there are more in any combination.

The pipe symbol

The pipe symbol allows you to specify two or more patterns that the regex engine uses in a logical OR formula when examining the data stream. If any of the patterns match the text, the pattern passes. If none of the patterns match, the pattern will fail, here is an example:

$ echo "This is a test" | awk '/test|exam/{print $0}'

$ echo "This is an exam" | awk '/test|exam/{print $0}'

$ echo "This is something else" | awk '/test|exam/{print $0}'

regex pipe symbol

This example looks for “test” or “exam” in the text. Keep in mind that you can’t place any spaces between the regular expressions and the pipe symbol.

Grouping expressions

Regex patterns can also be grouped by using parentheses. When you group a regex pattern, the group is treated like a standard character. You can use a special character to the group just as you’ve done with a regular character.

$ echo "Like" | awk '/Like(Geeks)?/{print $0}'

$ echo "LikeGeeks" | awk '/Like(Geeks)?/{print $0}'

regex grouping expressions

The grouping of the “Geeks” ending with the question mark allows the pattern to match either the full name “LikeGeeks” or the word “Like” only.

Practical examples

We saw some simple demonstrations of using regular expression patterns, it’s time to put that in action, just for practicing.

Counting directory files

Let’s look at a bash script that counts the executable files that are available in the directories defined in your PATH environment variable.

$ echo $PATH

To get a directory listing, you must replace each colon with space.

$ echo $PATH | sed 's/:/ /g'

Now let’s iterate through each directory using the for loop like this:

Great!!

Now we can use the ls command to list files in each directory and save the count in a variable

You may notice some directories doesn’t exist, no problem with this.

regex count files

Cool!! This is the power of regex. These few lines of code count all files in all directories. Of course, there is a Linux command to do that very easy, but here we discuss how to employ regex on something you can use. You can come up with some more useful ideas.

Validating e-mail address

There are a ton of websites that offer ready to use regex patterns for everything including e-mail, phone number, and much more, this is handy but we want to understand how it works.

username@hostname.com

The username can use any alphanumeric characters combined with dot, dash, plus sign, underscore.

The hostname can use any alphanumeric characters combined with a dot and underscore.

Let’s start formulating our regular expression pattern from the left side. We know that you may have multiple valid characters in the username. This should be very easy.

^([a-zA-Z0-9_\-\.\+]+)@

This grouping specifies the allowed characters in the username and the plus sign to indicate that at least one character must be present or more, then the @ sign.

Then the hostname pattern should be like this:

([a-zA-Z0-9_\-\.]+)

There are special rules for the TLDs or Top-level domains, and they must be no less than two characters (used in country codes) and no more than five characters in length. The following is the regex pattern for the top-level domain.

\.([a-zA-Z]{2,5})$

Now we put them all together:

^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$

Let’s test that regex against an email:

$ echo "name@host.com" | awk '/^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$/{print $0}'

$ echo "name@host.com.us" | awk '/^([a-zA-Z0-9_\-\.\+]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$/{print $0}'

regex validate email

Awesome!! Works great.

This was just the beginning of regex world that never ends. I hope after this post you understand these ASCII pukes and use it more professionally.

I hope you like the post.

Thank you.