Avoid Duplicates When using Pandas to_sql()
In this tutorial, you’ll learn how to use Pandas to detect and remove duplicates, ensuring your datasets remain clean.
You’ll learn how to use Pandas duplicated()
and drop_duplicates()
to clean DataFrames before inserting data.
Identifying Duplicates
Pandas provides the DataFrame.duplicated()
method to identify duplicate rows. Here’s how you can use it:
import pandas as pd data = { 'CustomerID': [1001, 1002, 1002, 1003, 1004], 'Name': ['Chris', 'Pat', 'Pat', 'Alex', 'Jordan'], 'Plan': ['Basic', 'Premium', 'Premium', 'Basic', 'Standard'] } df = pd.DataFrame(data) duplicates = df.duplicated() print(duplicates)
Output:
0 False 1 False 2 True 3 False 4 False dtype: bool
This output indicates that the third row (index 2) is a duplicate. The duplicated()
method, by default, considers all columns and marks rows as duplicates if they are identical to a previous row.
Remove Duplicates
After identifying duplicates in your DataFrame, the next step is to remove them.
You can use DataFrame.drop_duplicates()
method to remove duplicates.
Here’s how to use drop_duplicates()
to remove duplicate rows from your DataFrame:
df_unique = df.drop_duplicates() print(df_unique)
Output:
CustomerID Name Plan 0 1001 Chris Basic 1 1002 Pat Premium 3 1003 Alex Basic 4 1004 Jordan Standard
In this output, you can see that the duplicate row (the third row in the original DataFrame) is removed, leaving you with a unique set of data.
The drop_duplicates()
method keeps the first occurrence by default and removes subsequent duplicate entries.
Using to_sql after Dropping Duplicates
Once you have cleaned your DataFrame by dropping duplicates, the next crucial step is to insert this clean data into a SQL database.
Pandas to_sql
method allows you to write records stored in a DataFrame to a SQL database.
For demonstration, let’s create a SQLite connection (you can replace this with your specific database connection details):
from sqlalchemy import create_engine engine = create_engine('sqlite:///sample.db')
Now, use to_sql
to insert the unique data:
df_custom_unique.to_sql('customer_data', con=engine, index=False, if_exists='replace')
This command inserts the data from df_custom_unique
into the ‘customer_data’ table of the specified database.
The index=False
parameter ensures that the DataFrame’s index is not written into the database.
The if_exists='replace'
argument specifies that if the table already exists, it should be replaced. You can use if_exists='append'
if you want to add to an existing table instead.
Handling Duplicates Based on Specific Columns
You can use subset
parameter to identify duplicates based on specific columns:
duplicate_flags = df.duplicated(subset=['CustomerID', 'Plan']) print(duplicate_flags)
Output:
0 False 1 False 2 True 3 False 4 False dtype: bool
This output highlights that the third row is a duplicate based on the ‘CustomerID’ and ‘Plan’ columns.
Next, drop these identified duplicates:
# Dropping duplicates based on the same subset of columns df_no_duplicates = df.drop_duplicates(subset=['CustomerID', 'Plan']) print(df_no_duplicates)
Output:
CustomerID Name Plan 0 1001 Chris Basic 1 1002 Pat Premium 3 1003 Alex Basic 4 1004 Jordan Standard
The df_no_duplicates
DataFrame now contains unique rows based on the ‘CustomerID’ and ‘Plan’ columns.
Finally, insert this cleaned data into your SQL database:
# Using to_sql to insert the data into a SQL database df_no_duplicates.to_sql('customer_data', con=engine, index=False, if_exists='append')
Use Constraints to Avoid Inserting Duplicates
In addition to handling duplicates within Pandas, another effective method is to use database-level constraints to avoid inserting duplicates.
Setting Up Constraints in Your Database
Before inserting data, you should define unique constraints in your database schema. These constraints prevent the insertion of a new row if it violates the uniqueness of specified columns.
For example, in SQL, you might set a unique constraint on the ‘CustomerID’ and ‘Plan’ columns like this:
ALTER TABLE customer_data ADD CONSTRAINT unique_customer_plan UNIQUE (CustomerID, Plan);
This SQL command alters the ‘customer_data’ table by adding a unique constraint named ‘unique_customer_plan’ on the ‘CustomerID’ and ‘Plan’ columns.
Inserting Data with Pandas
After setting up your database constraints, you can proceed to insert data using Pandas to_sql
method.
In case of a constraint violation (i.e., attempting to insert a duplicate), the database will raise an error.
Here’s how you can handle this in Python:
try: df.to_sql('customer_data', con=engine, index=False, if_exists='append') print("Data inserted successfully.") except Exception as e: print(f"An error occurred: {e}")
This code attempts to insert the entire DataFrame into the ‘customer_data’ table.
If there are any duplicates based on the defined constraints, the database will throw an error, which is then caught and printed in the Python script.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.