Split Files Using Linux awk Command

In this tutorial, you’ll learn how to split large files into smaller ones using the awk command in Linux.

We’ll cover how to divide files based on row count, specific conditions, patterns, or while adding custom headers.

 

 

Split Based on Row Count

Let’s start by splitting a sample file data.txt into smaller files, each containing 1000 rows.

awk '{split(FILENAME, a, "."); prefix=a[1]; print > (prefix "." int((NR-1)/1000) + 1 ".txt")}' data.txt

This command will split the data.txt file into multiple output files, each containing 1000 rows.

The output files will be named data1.txt, data2.txt, and so on, depending on the number of rows in the original file.

 

Conditional Splitting

Let’s say you want to split a file named data.txt into two separate files: one containing rows where the second column value is greater than 50 and another file containing rows where the second column value is less than or equal to 50.

awk '{ split(FILENAME, a, "."); if ($2 > 50) print >> (a[1] "_greater_than_50.txt"); else print >> (a[1] "_less_than_or_equal_to_50.txt") }' data.txt

This command splits the data.txt file into two separate files: data_greater_than_50.txt containing rows where the second column value is greater than 50, and data_less_than_or_equal_to_50.txt containing rows where the second column value is less than or equal to 50.

 

Split Based on Pattern

Let’s say you want to split a file named data.txt into multiple files based on a specific pattern within the lines.

To split the file into separate files based on lines containing the word “pattern”, you can use the following awk command:

awk '/pattern/ { split(FILENAME, a, "."); print >> (a[1] "_pattern.txt") }' data.txt

 

Split File with Header

Let’s say you have a file named data.txt that you want to split into smaller files, each with a header indicating the content type.

For example, let’s split the file into smaller files with a header indicating “Data Set” followed by the rows.

awk 'BEGIN {header="Data Set"} { split(FILENAME, a, "."); if (NR%1000 == 1) {filename = a[1] "_" int(NR/1000) ".txt"; print header > filename} print >> filename }' data.txt

This command will split the data.txt file into multiple output files, each containing 1000 rows and preceded by a header indicating “Data Set”.

The output files will be named accordingly, such as data_1.txt, data_2.txt, and so on.

Leave a Reply

Your email address will not be published. Required fields are marked *