Avoid Duplicates When using Pandas to_sql()

In this tutorial, you’ll learn how to use Pandas to detect and remove duplicates, ensuring your datasets remain clean.

You’ll learn how to use Pandas duplicated() and drop_duplicates() to clean DataFrames before inserting data.



Identifying Duplicates

Pandas provides the DataFrame.duplicated() method to identify duplicate rows. Here’s how you can use it:

import pandas as pd
data = {
    'CustomerID': [1001, 1002, 1002, 1003, 1004],
    'Name': ['Chris', 'Pat', 'Pat', 'Alex', 'Jordan'],
    'Plan': ['Basic', 'Premium', 'Premium', 'Basic', 'Standard']
df = pd.DataFrame(data)
duplicates = df.duplicated()


0    False
1    False
2     True
3    False
4    False
dtype: bool

This output indicates that the third row (index 2) is a duplicate. The duplicated() method, by default, considers all columns and marks rows as duplicates if they are identical to a previous row.


Remove Duplicates

After identifying duplicates in your DataFrame, the next step is to remove them.

You can use DataFrame.drop_duplicates() method to remove duplicates.

Here’s how to use drop_duplicates() to remove duplicate rows from your DataFrame:

df_unique = df.drop_duplicates()


   CustomerID    Name      Plan
0        1001   Chris     Basic
1        1002     Pat   Premium
3        1003    Alex     Basic
4        1004  Jordan  Standard

In this output, you can see that the duplicate row (the third row in the original DataFrame) is removed, leaving you with a unique set of data.

The drop_duplicates() method keeps the first occurrence by default and removes subsequent duplicate entries.


Using to_sql after Dropping Duplicates

Once you have cleaned your DataFrame by dropping duplicates, the next crucial step is to insert this clean data into a SQL database.

Pandas to_sql method allows you to write records stored in a DataFrame to a SQL database.

For demonstration, let’s create a SQLite connection (you can replace this with your specific database connection details):

from sqlalchemy import create_engine
engine = create_engine('sqlite:///sample.db')

Now, use to_sql to insert the unique data:

df_custom_unique.to_sql('customer_data', con=engine, index=False, if_exists='replace')

This command inserts the data from df_custom_unique into the ‘customer_data’ table of the specified database.

The index=False parameter ensures that the DataFrame’s index is not written into the database.

The if_exists='replace' argument specifies that if the table already exists, it should be replaced. You can use if_exists='append' if you want to add to an existing table instead.


Handling Duplicates Based on Specific Columns

You can use subset parameter to identify duplicates based on specific columns:

duplicate_flags = df.duplicated(subset=['CustomerID', 'Plan'])


0    False
1    False
2     True
3    False
4    False
dtype: bool

This output highlights that the third row is a duplicate based on the ‘CustomerID’ and ‘Plan’ columns.

Next, drop these identified duplicates:

# Dropping duplicates based on the same subset of columns
df_no_duplicates = df.drop_duplicates(subset=['CustomerID', 'Plan'])


   CustomerID    Name      Plan
0        1001   Chris     Basic
1        1002     Pat   Premium
3        1003    Alex     Basic
4        1004  Jordan  Standard

The df_no_duplicates DataFrame now contains unique rows based on the ‘CustomerID’ and ‘Plan’ columns.

Finally, insert this cleaned data into your SQL database:

# Using to_sql to insert the data into a SQL database
df_no_duplicates.to_sql('customer_data', con=engine, index=False, if_exists='append')


Use Constraints to Avoid Inserting Duplicates

In addition to handling duplicates within Pandas, another effective method is to use database-level constraints to avoid inserting duplicates.

Setting Up Constraints in Your Database

Before inserting data, you should define unique constraints in your database schema. These constraints prevent the insertion of a new row if it violates the uniqueness of specified columns.

For example, in SQL, you might set a unique constraint on the ‘CustomerID’ and ‘Plan’ columns like this:

ALTER TABLE customer_data
ADD CONSTRAINT unique_customer_plan UNIQUE (CustomerID, Plan);

This SQL command alters the ‘customer_data’ table by adding a unique constraint named ‘unique_customer_plan’ on the ‘CustomerID’ and ‘Plan’ columns.

Inserting Data with Pandas

After setting up your database constraints, you can proceed to insert data using Pandas to_sql method.

In case of a constraint violation (i.e., attempting to insert a duplicate), the database will raise an error.

Here’s how you can handle this in Python:

    df.to_sql('customer_data', con=engine, index=False, if_exists='append')
    print("Data inserted successfully.")
except Exception as e:
    print(f"An error occurred: {e}")

This code attempts to insert the entire DataFrame into the ‘customer_data’ table.

If there are any duplicates based on the defined constraints, the database will throw an error, which is then caught and printed in the Python script.

Leave a Reply

Your email address will not be published. Required fields are marked *