Linux AWK Index Function: Find Substrings
The index
function in awk
allows you to find the position of a substring within a string.
In this tutorial, we’ll explore various aspects of the awk
index function.
We’ll begin with its syntax and usage, move on to handling case sensitivity and special characters, and learn how to find multiple occurrences of a substring.
Lastly, we will discuss validating user input using the awk
index function.
Syntax and Usage
The syntax is as follows:
index(string, substring)
Here, string
is the text you are searching in, and substring
is the text you are searching for.
The function returns the position of the first occurrence of substring
in string
.
If the substring is not found, it returns 0.
Imagine you have a data file data.txt
with the following content:
1,John Doe,New York 2,Jane Smith,California 3,Emily Davis,Texas
To find the position of the name “Smith” in the second line, you can use the awk
command with the index
function:
awk -F, '{print $2, index($2, "Smith")}' data.txt
Output:
John Doe 0 Jane Smith 6 Emily Davis 0
In this output, the command prints each name in the file along with the position of “Smith”.
In the second line, “Smith” begins at the 6th position in “Jane Smith”, while in other lines, as “Smith” is not present, the function returns 0.
Case Sensitivity in the Index Function
By default, the index
function in awk
is case-sensitive. This means that it distinguishes between uppercase and lowercase letters.
A common method is to convert both the string and the substring to the same case, either upper or lower, using the toupper()
or tolower()
functions in awk
.
This method ensures that the case of the characters does not affect the search.
If you need to find the position of “smith” (in lowercase) in a case-insensitive manner, you can convert both the string and the substring to the same case.
Here’s how you would do it using awk
:
awk -F, '{print $2, index(tolower($2), "smith")}' data.txt
Output:
John Doe 0 Jane Smith 6 Emily Davis 0
The tolower($2)
function converts the names to lowercase, and then index
function searches for “smith”.
As a result, “Jane Smith” matches “smith” at position 6, despite the difference in case.
Deal with Special Characters
In awk
, certain characters are considered special and have specific meanings.
These characters can be anything from spaces and commas to symbols like @
or #
.
If these characters are part of your search pattern in the index
function, they need to be escaped.
Escaping is done by prefixing the character with a backslash (\
).
Consider a modified version of the data.txt
file:
1,John Doe#New York 2,Jane Smith@California 3,Emily&Davis,Texas
If you want to find the position of @
, you’ll have to escape this character in your awk
command:
awk -F, '{print $2, index($2, "\@")}' data.txt
Output:
John Doe#New York 0 Jane Smith@California 11 Emily&Davis 0
In this output, the index
function returns 11
for the second line, indicating the position of the @
symbol in “Jane Smith@California”.
The backslash before @
ensures that awk
treats it as a literal character, not as a special character.
Finding Multiple Occurrences
To find multiple occurrences, you can set up a loop that continues to search the string from the point just after the last found occurrence.
In each iteration, the starting point of the search moves forward so you can find all instances of the substring.
Suppose you have the following entry in a file, data.txt
:
The quick brown fox jumps over the lazy dog. The fox is quick and brown.
You need to find all occurrences of the word “fox”. Here’s how you can do this with awk
:
awk ' { line = $0 search_term = "fox" pos = 1 while (pos > 0) { pos = index(line, search_term) if (pos > 0) { print "Found \"" search_term "\" at position", pos line = substr(line, pos + length(search_term)) } } } ' data.txt
Output:
Found "fox" at position 17 Found "fox" at position 31
In this output, the awk
script finds “fox” at positions 17 and 31.
The loop continues until index
returns 0, which means no more occurrences are found.
Validate User Input
Validating input means you want to confirm the presence or absence of certain strings or patterns.
Suppose you want to ensure that user input does not contain certain prohibited keywords, for example, “error” or “fail”.
Let’s say you have user input in a file feedback.txt
:
Network connection is excellent Encountered an error in connection Satisfied with the service
You can use awk
to check each line for prohibited words:
awk '{if (index($0, "error") > 0 || index($0, "fail") > 0) print "Invalid input:", $0; else print "Valid input:", $0;}' feedback.txt
Output:
Valid input: Network connection is excellent Invalid input: Encountered an error in connection Valid input: Satisfied with the service
Here, awk
scans each line of the input, and with the index
function, it checks for the presence of “error” or “fail”.
If found, it marks the input as invalid; otherwise, it’s considered valid.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.