Avoid Duplicates When using Pandas to

In this tutorial, you’ll learn how to use Pandas to detect and remove duplicates, ensuring your datasets remain clean.

You’ll learn how to use Pandas duplicated() and drop_duplicates() to clean DataFrames before inserting data.

Table of Contents hide

1 Identifying Duplicates
2 Remove Duplicates
3 Using to_sql after Dropping Duplicates
4 Handling Duplicates Based on Specific Columns
5 Use Constraints to Avoid Inserting Duplicates
- 5.1 Setting Up Constraints in Your Database
- 5.2 Inserting Data with Pandas

Identifying Duplicates

Pandas provides the DataFrame.duplicated() method to identify duplicate rows. Here’s how you can use it:

import pandas as pd
data = {
    'CustomerID': [1001, 1002, 1002, 1003, 1004],
    'Name': ['Chris', 'Pat', 'Pat', 'Alex', 'Jordan'],
    'Plan': ['Basic', 'Premium', 'Premium', 'Basic', 'Standard']
}
df = pd.DataFrame(data)
duplicates = df.duplicated()
print(duplicates)

Output:

0    False
1    False
2     True
3    False
4    False
dtype: bool

This output indicates that the third row (index 2) is a duplicate. The duplicated() method, by default, considers all columns and marks rows as duplicates if they are identical to a previous row.

Remove Duplicates

After identifying duplicates in your DataFrame, the next step is to remove them.

You can use DataFrame.drop_duplicates() method to remove duplicates.

Here’s how to use drop_duplicates() to remove duplicate rows from your DataFrame:

df_unique = df.drop_duplicates()
print(df_unique)

Output:

   CustomerID    Name      Plan
0        1001   Chris     Basic
1        1002     Pat   Premium
3        1003    Alex     Basic
4        1004  Jordan  Standard

In this output, you can see that the duplicate row (the third row in the original DataFrame) is removed, leaving you with a unique set of data.

The drop_duplicates() method keeps the first occurrence by default and removes subsequent duplicate entries.

Using to_sql after Dropping Duplicates

Once you have cleaned your DataFrame by dropping duplicates, the next crucial step is to insert this clean data into a SQL database.

Pandas to_sql method allows you to write records stored in a DataFrame to a SQL database.

For demonstration, let’s create a SQLite connection (you can replace this with your specific database connection details):

from sqlalchemy import create_engine
engine = create_engine('sqlite:///sample.db')

Now, use to_sql to insert the unique data:

df_custom_unique.to_sql('customer_data', con=engine, index=False, if_exists='replace')

This command inserts the data from df_custom_unique into the ‘customer_data’ table of the specified database.

The index=False parameter ensures that the DataFrame’s index is not written into the database.

The if_exists='replace' argument specifies that if the table already exists, it should be replaced. You can use if_exists='append' if you want to add to an existing table instead.

Handling Duplicates Based on Specific Columns

You can use subset parameter to identify duplicates based on specific columns:

duplicate_flags = df.duplicated(subset=['CustomerID', 'Plan'])
print(duplicate_flags)

Output:

0    False
1    False
2     True
3    False
4    False
dtype: bool

This output highlights that the third row is a duplicate based on the ‘CustomerID’ and ‘Plan’ columns.

Next, drop these identified duplicates:

# Dropping duplicates based on the same subset of columns
df_no_duplicates = df.drop_duplicates(subset=['CustomerID', 'Plan'])
print(df_no_duplicates)

Output:

   CustomerID    Name      Plan
0        1001   Chris     Basic
1        1002     Pat   Premium
3        1003    Alex     Basic
4        1004  Jordan  Standard

The df_no_duplicates DataFrame now contains unique rows based on the ‘CustomerID’ and ‘Plan’ columns.

Finally, insert this cleaned data into your SQL database:

# Using to_sql to insert the data into a SQL database
df_no_duplicates.to_sql('customer_data', con=engine, index=False, if_exists='append')

Use Constraints to Avoid Inserting Duplicates

In addition to handling duplicates within Pandas, another effective method is to use database-level constraints to avoid inserting duplicates.

Setting Up Constraints in Your Database

Before inserting data, you should define unique constraints in your database schema. These constraints prevent the insertion of a new row if it violates the uniqueness of specified columns.

For example, in SQL, you might set a unique constraint on the ‘CustomerID’ and ‘Plan’ columns like this:

ALTER TABLE customer_data
ADD CONSTRAINT unique_customer_plan UNIQUE (CustomerID, Plan);

This SQL command alters the ‘customer_data’ table by adding a unique constraint named ‘unique_customer_plan’ on the ‘CustomerID’ and ‘Plan’ columns.

Inserting Data with Pandas

After setting up your database constraints, you can proceed to insert data using Pandas to_sql method.

In case of a constraint violation (i.e., attempting to insert a duplicate), the database will raise an error.

Here’s how you can handle this in Python:

try:
    df.to_sql('customer_data', con=engine, index=False, if_exists='append')
    print("Data inserted successfully.")
except Exception as e:
    print(f"An error occurred: {e}")

This code attempts to insert the entire DataFrame into the ‘customer_data’ table.

If there are any duplicates based on the defined constraints, the database will throw an error, which is then caught and printed in the Python script.

Mokhtar Ebrahim

Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.

Avoid Duplicates When using Pandas to_sql()

Identifying Duplicates

Remove Duplicates

Using to_sql after Dropping Duplicates

Handling Duplicates Based on Specific Columns

Use Constraints to Avoid Inserting Duplicates

Setting Up Constraints in Your Database

Inserting Data with Pandas

Leave a Reply Cancel reply

Identifying Duplicates

Remove Duplicates

Using to_sql after Dropping Duplicates

Handling Duplicates Based on Specific Columns

Use Constraints to Avoid Inserting Duplicates

Setting Up Constraints in Your Database

Inserting Data with Pandas

Related posts

Leave a Reply Cancel reply