Split Column Into Multiple Columns Using awk split()
The split
function in awk allows you to split a string into array elements based on a specified delimiter.
In this tutorial, we’ll explore various methods to split columns using awk
split
function, including handling different delimiters, using regular expressions, conditional splitting, and rearranging the split columns.
Split First/Last Column
Consider the following dataset:
A1,B1,C1 A2,B2,C2 A3,B3,C3
Split the First Column
To split the first column, use this awk
command:
awk -F, '{split($1, a, /[0-9]+/); print a[1], substr($1, length(a[1])+1), $2, $3}' data.txt
Output:
A 1 B1 C1 A 2 B2 C2 A 3 B3 C3
Here, the split
function divides the first field ($1
) into array a
, using /[0-9]+/
regex which indicates the number as the delimiter.
Split the Last Column
For splitting the last column, the command will be:
awk 'BEGIN {FS=","; OFS=" "} {split($3, arr, ""); $3=""; $4=arr[1]; $5=arr[2]; print}' data.txt
Output:
A1 B1 C 1 A2 B2 C 2 A3 B3 C 3
The split($3, arr, "")
function splits the third column ($3
) into individual characters and stores them in the array arr
.
The subsequent lines $3=""; $4=arr[1]; $5=arr[2]
remove the original content of the third column and assign the first and second elements of the arr
array to the fourth and fifth columns, respectively.
Split All Columns
Consider a dataset like this:
A1-B1,C1-D1,E1-F1 A2-B2,C2-D2,E2-F2 A3-B3,C3-D3,E3-F3
To split each column at the hyphen -
, you can use the following awk
command:
awk -F, '{ split($1, a, "-"); split($2, b, "-"); split($3, c, "-"); print a[1], a[2], b[1], b[2], c[1], c[2] }' data.txt
Output:
A1 B1 C1 D1 E1 F1 A2 B2 C2 D2 E2 F2 A3 B3 C3 D3 E3 F3
In this command, -F,
sets the field separator as a comma for the initial column separation.
Then, the split
function is applied to each field ($1
, $2
, $3
), using -
as the delimiter.
Split Pipe-Separated Columns
For a pipe-separated dataset like:
A1|B1|C1 A2|B2|C2 A3|B3|C3
Set the field separator to a pipe:
awk -F'|' '{ split($1, a, "-"); split($2, b, "-"); print a[1], a[2], b[1], b[2], $3 }' data.txt
Output:
A1 B1 C1 A2 B2 C2 A3 B3 C3
Split Columns Based on Regular Expression
Consider a dataset like this:
A1X23B4 C5Y67D8 E9Z01F2
To split each line where a digit changes to a letter or vice versa, you can use awk
with a regex pattern:
awk '{ line = $0; new_line = ""; for (i = 1; i <= length(line); i++) { char = substr(line, i, 1); if (i > 1) { prev_char = substr(line, i-1, 1); if ((char ~ /[A-Za-z]/) != (prev_char ~ /[A-Za-z]/) || (char ~ /[0-9]/) != (prev_char ~ /[0-9]/)) { new_line = new_line " " char; } else { new_line = new_line char; } } else { new_line = new_line char; } } # Split the modified line based on spaces and print each element n = split(new_line, a, " "); for (i = 1; i <= n; i++) { printf("%s ", a[i]); } printf("\n"); }' data.txt
Output:
A 1 X 23 B 4 C 5 Y 67 D 8 E 9 Z 01 F 2
This command iterates over each character in the input line and checks if the transition from a digit to a letter or vice versa occurs by comparing the current character with the previous one.
If a transition occurs, it inserts a space before the current character.
Then it appends the current character (with a possible leading space) to the new line.
After processing all characters, it splits the new line based on spaces using the split
function and stores the results in the array a
.
The loop iterates over the array a
, printing each element followed by a space.
Conditional Splitting
Suppose you have a dataset like this:
A1,100 B2,200 C3,150 D4,300
Suppose you want to split the first column if the second column’s value is greater than 150, otherwise, print the whole line as it is.
Here’s how to do it using awk
:
awk -F, '{ if ($2 > 150) { split($1, arr, ""); print arr[1] "," arr[2] "," $2 } else { print $0 } }' data.txt
Output:
A1,100 B,2,200 C3,150 D,4,300
In this awk
script, -F,
sets the field separator to a comma.
split($1, arr, "")
still splits the first field ($1
) into an array arr
using an empty string as the separator.
print arr[1] "," arr[2] "," $2
prints both the first and second characters of the first column, followed by the second column’s value, separated by commas.
print $0
If the condition is not met, it prints the whole line as it is.
Split Columns Based on Multiple Field Separators
Imagine a dataset like this:
A1 B1,C1 D1-E1 A2 B2,C2 D2-E2 A3 B3,C3 D3-E3
Suppose you want to split the data using commas, spaces, and hyphens as field separators.
Use the following awk
command:
awk -F'[ ,\-]' '{ for (i = 1; i <= NF; i++) { printf("%s ", $i); } printf("\n"); }' data.txt
Output:
A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 A3 B3 C3 D3 E3
The command -F'[ ,\-]'
sets the field separator to a regular expression that includes a space, comma, and hyphen.
The script then iterates through each field ($i
) in the line and prints each one followed by a space for clarity.
Rearrange Splitted Columns
Imagine a dataset like this:
A1,B1,C1 A2,B2,C2 A3,B3,C3
To split these columns and then rearrange them so that the third part becomes the first, the first becomes the second, and the second becomes the third, you can use awk
like this:
awk -F, '{ split($0, a, ","); print a[3], a[1], a[2] }' data.txt
Output:
C1 A1 B1 C2 A2 B2 C3 A3 B3
In this script, -F,
sets the field separator as a comma.
The split
function is used to divide the entire line ($0
) into an array a
.
Then, the print
statement rearranges and prints the elements of the array in the desired order: third, first, then second.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.