Remove Duplicates From CSV Files Using Linux awk
In this tutorial, you’ll learn how to remove duplicate entries from CSV files using Linux awk
.
We’ll start with the basics of removing duplicate lines, then move on to more complex examples like removing duplicates based on a single or multiple columns.
Remove Duplicate Lines
Suppose you have a CSV file that includes columns like customer_id
, name
, email
, and service_plan
:
customer_id,name,email,service_plan 1001,Alex,alex@example.com,Plan A 1002,Casey,casey@example.com,Plan B 1003,Bailey,bailey@example.com,Plan C 1002,Casey,casey@example.com,Plan B
Notice that the second and fourth lines are duplicates.
To remove these duplicates, you can use the following awk
command:
awk -F, '!seen[$0]++' customer_data.csv
Output:
customer_id,name,email,service_plan 1001,Alex,alex@example.com,Plan A 1002,Casey,casey@example.com,Plan B 1003,Bailey,bailey@example.com,Plan C
This command uses awk
with the field separator -F,
set to a comma, as CSV files are typically comma-separated. The !seen[$0]++
is an awk
pattern-action statement.
It checks each line $0
against the array seen
. If the line hasn’t been encountered before, it’s printed; otherwise, it’s skipped.
Remove Duplicates Based on a Single Column
Suppose you have the following CSV data:
customer_id,name,email,service_plan 1001,Alex,alex@example.com,Plan A 1002,Casey,casey@example.com,Plan B 1003,Bailey,bailey@example.com,Plan C 1004,Devin,casey@example.com,Plan D
In this case, even though customers 1002 and 1004 have different names and service plans, they have the same email address.
To remove duplicates based on email column, you can use the following awk
command:
awk -F, '!seen[$3]++' customer_data.csv
Output:
customer_id,name,email,service_plan 1001,Alex,alex@example.com,Plan A 1002,Casey,casey@example.com,Plan B 1003,Bailey,bailey@example.com,Plan C
The !seen[$3]++
checks the third field $3
(the email column) against the seen
array.
The entire line is printed if an email hasn’t been encountered before.
Remove Duplicates Based on Multiple Columns
Suppose you want to remove duplicates based on both email
and service_plan
.
Consider the following sample data:
customer_id,name,email,service_plan 1001,Alex,alex@example.com,Plan A 1002,Casey,casey@example.com,Plan B 1003,Bailey,bailey@example.com,Plan C 1004,Devin,casey@example.com,Plan B 1005,Evan,casey@example.com,Plan C
To remove entries where both the email and service plan are duplicates, you can use the following awk
command:
awk -F, '!seen[$3,$4]++' customer_data.csv
Output:
customer_id,name,email,service_plan 1001,Alex,alex@example.com,Plan A 1002,Casey,casey@example.com,Plan B 1003,Bailey,bailey@example.com,Plan C 1005,Evan,casey@example.com,Plan C
This command checks a combination of the third ($3
) and fourth ($4
) fields.
Only the first occurrence of a unique combination is printed.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.