Remove Duplicates From CSV Files Using Linux awk

In this tutorial, you’ll learn how to remove duplicate entries from CSV files using Linux awk.

We’ll start with the basics of removing duplicate lines, then move on to more complex examples like removing duplicates based on a single or multiple columns.

 

 

Remove Duplicate Lines

Suppose you have a CSV file that includes columns like customer_id, name, email, and service_plan:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C
1002,Casey,casey@example.com,Plan B

Notice that the second and fourth lines are duplicates.

To remove these duplicates, you can use the following awk command:

awk -F, '!seen[$0]++' customer_data.csv

Output:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C

This command uses awk with the field separator -F, set to a comma, as CSV files are typically comma-separated. The !seen[$0]++ is an awk pattern-action statement.

It checks each line $0 against the array seen. If the line hasn’t been encountered before, it’s printed; otherwise, it’s skipped.

 

Remove Duplicates Based on a Single Column

Suppose you have the following CSV data:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C
1004,Devin,casey@example.com,Plan D

In this case, even though customers 1002 and 1004 have different names and service plans, they have the same email address.

To remove duplicates based on email column, you can use the following awk command:

awk -F, '!seen[$3]++' customer_data.csv

Output:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C

The !seen[$3]++ checks the third field $3 (the email column) against the seen array.

The entire line is printed if an email hasn’t been encountered before.

 

Remove Duplicates Based on Multiple Columns

Suppose you want to remove duplicates based on both email and service_plan.

Consider the following sample data:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C
1004,Devin,casey@example.com,Plan B
1005,Evan,casey@example.com,Plan C

To remove entries where both the email and service plan are duplicates, you can use the following awk command:

awk -F, '!seen[$3,$4]++' customer_data.csv

Output:

customer_id,name,email,service_plan
1001,Alex,alex@example.com,Plan A
1002,Casey,casey@example.com,Plan B
1003,Bailey,bailey@example.com,Plan C
1005,Evan,casey@example.com,Plan C

This command checks a combination of the third ($3) and fourth ($4) fields.

Only the first occurrence of a unique combination is printed.

Leave a Reply

Your email address will not be published. Required fields are marked *