Split Column Into Multiple Columns Using awk split()

The split function in awk allows you to split a string into array elements based on a specified delimiter.

In this tutorial, we’ll explore various methods to split columns using awk split function, including handling different delimiters, using regular expressions, conditional splitting, and rearranging the split columns.

 

 

Split First/Last Column

Consider the following dataset:

A1,B1,C1
A2,B2,C2
A3,B3,C3

Split the First Column

To split the first column, use this awk command:

awk -F, '{split($1, a, /[0-9]+/); print a[1], substr($1, length(a[1])+1), $2, $3}' data.txt

Output:

A 1 B1 C1
A 2 B2 C2
A 3 B3 C3

Here, the split function divides the first field ($1) into array a, using /[0-9]+/ regex which indicates the number as the delimiter.

Split the Last Column

For splitting the last column, the command will be:

awk 'BEGIN {FS=","; OFS=" "} {split($3, arr, ""); $3=""; $4=arr[1]; $5=arr[2]; print}' data.txt

Output:

A1 B1 C 1
A2 B2 C 2
A3 B3 C 3

The split($3, arr, "") function splits the third column ($3) into individual characters and stores them in the array arr.

The subsequent lines $3=""; $4=arr[1]; $5=arr[2] remove the original content of the third column and assign the first and second elements of the arr array to the fourth and fifth columns, respectively.

 

Split All Columns

Consider a dataset like this:

A1-B1,C1-D1,E1-F1
A2-B2,C2-D2,E2-F2
A3-B3,C3-D3,E3-F3

To split each column at the hyphen -, you can use the following awk command:

awk -F, '{
    split($1, a, "-");
    split($2, b, "-");
    split($3, c, "-");
    print a[1], a[2], b[1], b[2], c[1], c[2]
}' data.txt

Output:

A1 B1 C1 D1 E1 F1
A2 B2 C2 D2 E2 F2
A3 B3 C3 D3 E3 F3

In this command, -F, sets the field separator as a comma for the initial column separation.

Then, the split function is applied to each field ($1, $2, $3), using - as the delimiter.

 

Split Pipe-Separated Columns

For a pipe-separated dataset like:

A1|B1|C1
A2|B2|C2
A3|B3|C3

Set the field separator to a pipe:

awk -F'|' '{
    split($1, a, "-");
    split($2, b, "-");
    print a[1], a[2], b[1], b[2], $3
}' data.txt

Output:

A1 B1 C1
A2 B2 C2
A3 B3 C3

 

Split Columns Based on Regular Expression

Consider a dataset like this:

A1X23B4
C5Y67D8
E9Z01F2

To split each line where a digit changes to a letter or vice versa, you can use awk with a regex pattern:

awk '{
    line = $0;
    new_line = "";
    for (i = 1; i <= length(line); i++) { char = substr(line, i, 1); if (i > 1) {
            prev_char = substr(line, i-1, 1);
            if ((char ~ /[A-Za-z]/) != (prev_char ~ /[A-Za-z]/) || (char ~ /[0-9]/) != (prev_char ~ /[0-9]/)) {
                new_line = new_line " " char;
            } else {
                new_line = new_line char;
            }
        } else {
            new_line = new_line char;
        }
    }
    # Split the modified line based on spaces and print each element
    n = split(new_line, a, " ");
    for (i = 1; i <= n; i++) {
        printf("%s ", a[i]);
    }
    printf("\n");
}' data.txt

Output:

A 1 X 23 B 4 
C 5 Y 67 D 8 
E 9 Z 01 F 2 

This command iterates over each character in the input line and checks if the transition from a digit to a letter or vice versa occurs by comparing the current character with the previous one.

If a transition occurs, it inserts a space before the current character.
Then it appends the current character (with a possible leading space) to the new line.

After processing all characters, it splits the new line based on spaces using the split function and stores the results in the array a.

The loop iterates over the array a, printing each element followed by a space.

 

Conditional Splitting

Suppose you have a dataset like this:

A1,100
B2,200
C3,150
D4,300

Suppose you want to split the first column if the second column’s value is greater than 150, otherwise, print the whole line as it is.

Here’s how to do it using awk:

awk -F, '{
    if ($2 > 150) {
        split($1, arr, "");
        print arr[1] "," arr[2] "," $2
    } else {
        print $0
    }
}' data.txt

Output:

A1,100
B,2,200
C3,150
D,4,300

In this awk script, -F, sets the field separator to a comma.

split($1, arr, "") still splits the first field ($1) into an array arr using an empty string as the separator.

print arr[1] "," arr[2] "," $2 prints both the first and second characters of the first column, followed by the second column’s value, separated by commas.

print $0 If the condition is not met, it prints the whole line as it is.

 

Split Columns Based on Multiple Field Separators

Imagine a dataset like this:

A1 B1,C1 D1-E1
A2 B2,C2 D2-E2
A3 B3,C3 D3-E3

Suppose you want to split the data using commas, spaces, and hyphens as field separators.

Use the following awk command:

awk -F'[ ,\-]' '{
    for (i = 1; i <= NF; i++) {
        printf("%s ", $i);
    }
    printf("\n");
}' data.txt

Output:

A1 B1 C1 D1 E1 
A2 B2 C2 D2 E2 
A3 B3 C3 D3 E3 

The command -F'[ ,\-]' sets the field separator to a regular expression that includes a space, comma, and hyphen.

The script then iterates through each field ($i) in the line and prints each one followed by a space for clarity.

 

Rearrange Splitted Columns

Imagine a dataset like this:

A1,B1,C1
A2,B2,C2
A3,B3,C3

To split these columns and then rearrange them so that the third part becomes the first, the first becomes the second, and the second becomes the third, you can use awk like this:

awk -F, '{
    split($0, a, ",");
    print a[3], a[1], a[2]
}' data.txt

Output:

C1 A1 B1
C2 A2 B2
C3 A3 B3

In this script, -F, sets the field separator as a comma.

The split function is used to divide the entire line ($0) into an array a.

Then, the print statement rearranges and prints the elements of the array in the desired order: third, first, then second.

Leave a Reply

Your email address will not be published. Required fields are marked *