Split Files Using Linux awk Command
In this tutorial, you’ll learn how to split large files into smaller ones using the awk
command in Linux.
We’ll cover how to divide files based on row count, specific conditions, patterns, or while adding custom headers.
Split Based on Row Count
Let’s start by splitting a sample file data.txt
into smaller files, each containing 1000 rows.
awk '{split(FILENAME, a, "."); prefix=a[1]; print > (prefix "." int((NR-1)/1000) + 1 ".txt")}' data.txt
This command will split the data.txt
file into multiple output files, each containing 1000 rows.
The output files will be named data1.txt
, data2.txt
, and so on, depending on the number of rows in the original file.
Conditional Splitting
Let’s say you want to split a file named data.txt
into two separate files: one containing rows where the second column value is greater than 50 and another file containing rows where the second column value is less than or equal to 50.
awk '{ split(FILENAME, a, "."); if ($2 > 50) print >> (a[1] "_greater_than_50.txt"); else print >> (a[1] "_less_than_or_equal_to_50.txt") }' data.txt
This command splits the data.txt
file into two separate files: data_greater_than_50.txt
containing rows where the second column value is greater than 50, and data_less_than_or_equal_to_50.txt
containing rows where the second column value is less than or equal to 50.
Split Based on Pattern
Let’s say you want to split a file named data.txt
into multiple files based on a specific pattern within the lines.
To split the file into separate files based on lines containing the word “pattern”, you can use the following awk command:
awk '/pattern/ { split(FILENAME, a, "."); print >> (a[1] "_pattern.txt") }' data.txt
Split File with Header
Let’s say you have a file named data.txt
that you want to split into smaller files, each with a header indicating the content type.
For example, let’s split the file into smaller files with a header indicating “Data Set” followed by the rows.
awk 'BEGIN {header="Data Set"} { split(FILENAME, a, "."); if (NR%1000 == 1) {filename = a[1] "_" int(NR/1000) ".txt"; print header > filename} print >> filename }' data.txt
This command will split the data.txt
file into multiple output files, each containing 1000 rows and preceded by a header indicating “Data Set”.
The output files will be named accordingly, such as data_1.txt
, data_2.txt
, and so on.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.