Linux AWK Index Function: Find Substrings

The index function in awk allows you to find the position of a substring within a string.

In this tutorial, we’ll explore various aspects of the awk index function.

We’ll begin with its syntax and usage, move on to handling case sensitivity and special characters, and learn how to find multiple occurrences of a substring.

Lastly, we will discuss validating user input using the awk index function.



Syntax and Usage

The syntax is as follows:

index(string, substring)

Here, string is the text you are searching in, and substring is the text you are searching for.

The function returns the position of the first occurrence of substring in string.

If the substring is not found, it returns 0.

Imagine you have a data file data.txt with the following content:

1,John Doe,New York
2,Jane Smith,California
3,Emily Davis,Texas

To find the position of the name “Smith” in the second line, you can use the awk command with the index function:

awk -F, '{print $2, index($2, "Smith")}' data.txt


John Doe 0
Jane Smith 6
Emily Davis 0

In this output, the command prints each name in the file along with the position of “Smith”.

In the second line, “Smith” begins at the 6th position in “Jane Smith”, while in other lines, as “Smith” is not present, the function returns 0.


Case Sensitivity in the Index Function

By default, the index function in awk is case-sensitive. This means that it distinguishes between uppercase and lowercase letters.

A common method is to convert both the string and the substring to the same case, either upper or lower, using the toupper() or tolower() functions in awk.

This method ensures that the case of the characters does not affect the search.

If you need to find the position of “smith” (in lowercase) in a case-insensitive manner, you can convert both the string and the substring to the same case.

Here’s how you would do it using awk:

awk -F, '{print $2, index(tolower($2), "smith")}' data.txt


John Doe 0
Jane Smith 6
Emily Davis 0

The tolower($2) function converts the names to lowercase, and then index function searches for “smith”.

As a result, “Jane Smith” matches “smith” at position 6, despite the difference in case.


Deal with Special Characters

In awk, certain characters are considered special and have specific meanings.

These characters can be anything from spaces and commas to symbols like @ or #.

If these characters are part of your search pattern in the index function, they need to be escaped.

Escaping is done by prefixing the character with a backslash (\).

Consider a modified version of the data.txt file:

1,John Doe#New York
2,Jane Smith@California

If you want to find the position of @, you’ll have to escape this character in your awk command:

awk -F, '{print $2, index($2, "\@")}' data.txt


John Doe#New York 0
Jane Smith@California 11
Emily&Davis 0

In this output, the index function returns 11 for the second line, indicating the position of the @ symbol in “Jane Smith@California”.

The backslash before @ ensures that awk treats it as a literal character, not as a special character.


Finding Multiple Occurrences

To find multiple occurrences, you can set up a loop that continues to search the string from the point just after the last found occurrence.

In each iteration, the starting point of the search moves forward so you can find all instances of the substring.

Suppose you have the following entry in a file, data.txt:

The quick brown fox jumps over the lazy dog. The fox is quick and brown.

You need to find all occurrences of the word “fox”. Here’s how you can do this with awk:

awk '
  line = $0
  search_term = "fox"
  pos = 1
  while (pos > 0) {
    pos = index(line, search_term)
    if (pos > 0) {
      print "Found \"" search_term "\" at position", pos
      line = substr(line, pos + length(search_term))
' data.txt


Found "fox" at position 17
Found "fox" at position 31

In this output, the awk script finds “fox” at positions 17 and 31.

The loop continues until index returns 0, which means no more occurrences are found.


Validate User Input

Validating input means you want to confirm the presence or absence of certain strings or patterns.

Suppose you want to ensure that user input does not contain certain prohibited keywords, for example, “error” or “fail”.

Let’s say you have user input in a file feedback.txt:

Network connection is excellent
Encountered an error in connection
Satisfied with the service

You can use awk to check each line for prohibited words:

awk '{if (index($0, "error") > 0 || index($0, "fail") > 0) print "Invalid input:", $0; else print "Valid input:", $0;}' feedback.txt


Valid input: Network connection is excellent
Invalid input: Encountered an error in connection
Valid input: Satisfied with the service

Here, awk scans each line of the input, and with the index function, it checks for the presence of “error” or “fail”.

If found, it marks the input as invalid; otherwise, it’s considered valid.

Leave a Reply

Your email address will not be published. Required fields are marked *