Sed Grouping and Backreferences for Text Processing
Grouping and backreferencing are fundamental concepts in regular expressions, and sed provides support for both.
Grouping allows you to treat multiple characters as a single unit, and backreferences allow you to refer to previously matched groups.
In this tutorial, we will cover various topics related to grouping and backreferencing in sed
, such as capturing groups, alternation inside groups, and more.
- 1 Capturing Groups and Backreferences
- 2 Using Braces {} for Repetition in sed Groups
- 3 Using Alternation Inside Groups (|)
- 4 Escaping Parentheses and Metacharacters in Groups
- 5 Limitations and Edge Cases with Backreferences
- 6 Common Mistakes While Using Groups and Backreferences
- 7 Efficient Log Processing Using sed
Capturing Groups and Backreferences
Suppose you have a file called names.txt
with the following content:
John Doe Jane Doe
And you want to change the order of the names to last name, first name. You can use the sed
command with capturing groups and backreferences to accomplish this.
Command:
sed 's/\(.*\) \(.*\)/\2, \1/' names.txt
Output:
Doe, John Doe, Jane
In this command, the search pattern \(.*\) \(.*\)
is used to match the whole line and capture the first and last names. The first \(.*\)
captures the first name, and the second \(.*\)
captures the last name.
The replacement pattern \2, \1
is used to replace the matched text.
The \1
is a backreference to the first captured group, and \2
is a backreference to the second captured group.
So, the replacement pattern swaps the first and last names and adds a comma in between.
Using Braces {} for Repetition in sed Groups
The braces {}
in sed
are used to specify the repetition of a character or a group of characters.
You can use them to match a specific number of repetitions of a character or a group.
For example, suppose you have a file called numbers.txt
with the following content:
111 222 333
And you want to change the 111
to 1
, 222
to 2
, and 333
to 3
. You can use the sed
command with braces {}
to specify the repetition.
Command:
sed 's/\(1\)\1\{2\}/\1/' numbers.txt
Output:
1 222 333
In this command, the search pattern \(1\)\1\{2\}
is used to match 111
in the input. The \(1\)
captures the first 1
, and the \1\{2\}
matches the next two 1
s.
The replacement pattern \1
is used to replace the matched text with the first captured group, which is 1
.
As a result, 111
is replaced with 1
in the output.
You can run the command again with 2
and 3
to change 222
to 2
and 333
to 3
.
Using Alternation Inside Groups (|)
Alternation in sed
is used to match this OR that. You can use the pipe symbol |
inside groups ()
to specify alternation.
Note that you need to use the -E
option to enable extended regular expressions in sed
for this to work.
For example, let’s say you have a file called days.txt
with the following content:
Monday Tuesday Wednesday
And you want to replace Monday
or Wednesday
with Holiday
. You can use the sed
command with alternation inside groups to accomplish this.
Command:
sed -E 's/(Monday|Wednesday)/Holiday/' days.txt
Output:
Holiday Tuesday Holiday
The search pattern (Monday|Wednesday)
is used to match Monday
or Wednesday
in the input.
The replacement pattern Holiday
is used to replace the matched text.
Escaping Parentheses and Metacharacters in Groups
In sed
, parentheses ()
are used for grouping, but if you want to match literal parentheses or other metacharacters in the input, you need to escape them with a backslash \
.
For example, let’s say you have a file called data.txt
with the following content:
(abc) def (ghi)
And you want to remove the parentheses. You can use the sed
command with escaped parentheses to accomplish this.
Command:
sed 's/(/[/g; s/)/]/g' data.txt
Output:
[abc] def [ghi]
In this example, the sed
command replaces all occurrences of (
with [
and all occurrences of )
with ]
.
However, if you are using sed
with the -E
option (which enables extended regular expressions), the parentheses (
and )
are treated as special characters by default, and you must escape them to treat them as literal characters:
sed -E 's/\(/[/g; s/\)/]/g' data.txt
This command will produce the same output as the previous example.
Limitations and Edge Cases with Backreferences
Backreferences in sed
are powerful, but there are some limitations and edge cases you should be aware of:
- Limited Number of Backreferences: In
sed
, you can only use up to 9 backreferences (\1
to\9
) in the replacement pattern. If you have more than 9 captured groups in the search pattern, you can only reference the first 9 in the replacement pattern. - No Support for Lookahead and Lookbehind Assertions: As mentioned earlier,
sed
does not support lookahead and lookbehind assertions, which can limit the use of backreferences in some cases. - No Support for Nested Capturing Groups:
sed
does not support nested capturing groups, which can make it difficult to work with backreferences in some cases.
Common Mistakes While Using Groups and Backreferences
Using groups and backreferences in sed
can be tricky, and there are some common mistakes that people often make:
- Not Escaping Special Characters: In Basic Regular Expressions (BRE), which is the default mode of
sed
, special characters like(
,)
,{
, and}
must be escaped with a backslash\
to be treated as group delimiters or repetition operators. For example,\(
and\)
are used for grouping, and\{m,n\}
is used for repetition. - Using Wrong Backreference: The backreferences
\1
to\9
refer to the captured groups in the order they appear in the search pattern. It is a common mistake to use the wrong backreference in the replacement pattern. - Confusing Grouping with Capturing: All capturing groups
( ... )
are also grouping constructs, but not all grouping constructs are capturing groups. For example,(?: ... )
is a non-capturing group that groups the enclosed tokens but does not capture the matched text. - Not Considering Line Breaks: By default,
sed
processes the input line by line, so the^
and$
anchors match the start and end of a line, not the start and end of the entire input. If you need to work with multiple lines, you need to use theN
command to append the next line of input to the pattern space. - Using
*
Instead of.*
:*
matches zero or more of the preceding token, while.*
matches zero or more of any character (except a newline). It is a common mistake to use*
instead of.*
in the search pattern.
Efficient Log Processing Using sed
One particular freelance job I had was for a client in the US who had a massive log file that needed to be processed and reformatted.
The log file contained around 10 million lines of text, with each line having a set of values separated by a delimiter.
The client needed to rearrange some parts of the text, swap some values, and remove redundant entries.
I tried to use Python to process the file, but due to the massive size of the file, it was taking too long to process.
I then realized that the sed command in Linux would be a perfect fit for this task as it is designed for text processing and works directly on the file without having to load the entire file into memory.
Grouping allows you to group together a section of a regular expression, and back-references allow you to refer back to the grouped text.
The lines were like this:
timestamp|IP|URL|status
And the client wanted it to be rearranged to:
IP|timestamp|status|URL
I could use the sed command with grouping and back-references to easily rearrange the text:
sed 's/\(.*\)|\(.*\)|\(.*\)|\(.*\)/\2|\1|\4|\3/' input.log > output.log
In this command, the s tells sed
to perform a substitution, and the () are used to group parts of the text. The ‘\1’, ‘\2’, etc., are back-references that refer to the grouped parts of the text.
Using sed
with grouping and back-references, I was able to process the 10 million line file in under 30 minutes, while the Python script I initially wrote was only able to process around 10% of the file in the same amount of time.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.