Remove Duplicates From CSV Files Using Linux awk

In this tutorial, you’ll learn how to remove duplicate entries from CSV files using Linux awk.

We’ll start with the basics of removing duplicate lines, then move on to more complex examples like removing duplicates based on a single or multiple columns.

Table of Contents hide

1 Remove Duplicate Lines
2 Remove Duplicates Based on a Single Column
3 Remove Duplicates Based on Multiple Columns

Remove Duplicate Lines

Suppose you have a CSV file that includes columns like customer_id, name, email, and service_plan:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C
1002,Casey,casey@example.com,Plan B

Notice that the second and fourth lines are duplicates.

To remove these duplicates, you can use the following awk command:

awk -F, '!seen[$0]++' customer_data.csv

Output:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C

This command uses awk with the field separator -F, set to a comma, as CSV files are typically comma-separated. The !seen[$0]++ is an awk pattern-action statement.

It checks each line $0 against the array seen. If the line hasn’t been encountered before, it’s printed; otherwise, it’s skipped.

Remove Duplicates Based on a Single Column

Suppose you have the following CSV data:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C
1004,Devin,casey@example.com,Plan D

In this case, even though customers 1002 and 1004 have different names and service plans, they have the same email address.

To remove duplicates based on email column, you can use the following awk command:

awk -F, '!seen[$3]++' customer_data.csv

Output:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C

The !seen[$3]++ checks the third field $3 (the email column) against the seen array.

The entire line is printed if an email hasn’t been encountered before.

Remove Duplicates Based on Multiple Columns

Suppose you want to remove duplicates based on both email and service_plan.

Consider the following sample data:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C
1004,Devin,casey@example.com,Plan B
1005,Evan,casey@example.com,Plan C

To remove entries where both the email and service plan are duplicates, you can use the following awk command:

awk -F, '!seen[$3,$4]++' customer_data.csv

Output:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C
1005,Evan,casey@example.com,Plan C

This command checks a combination of the third ($3) and fourth ($4) fields.

Only the first occurrence of a unique combination is printed.

Mokhtar Ebrahim

Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.

Remove Duplicate Lines

Remove Duplicates Based on a Single Column

Remove Duplicates Based on Multiple Columns

Related posts

Leave a Reply Cancel reply