Export Python Pandas DataFrame to SQL using to_sql
The to_sql
function allows you to write records stored in a DataFrame to a SQL database. By using SQLAlchemy, it makes it possible to use any DB supported by that library.
In the coming sections, we’ll dive deeper into this function and explore more functionalities.
- 1 Basic Syntax and Usage
- 2 Connecting to Various Databases
- 3 Export DataFrame to SQL using Pandas to_sql
- 4 Understanding the if_exists parameter
- 5 Customizing Data Types and Conversions
- 6 Index Management
- 7 MultiIndex Support
- 8 Managing Insertion (None, Multi, callable)
- 9 Dealing with Large Data Sets
- 10 Further Reading
Basic Syntax and Usage
The basic syntax of the to_sql
function is as follows:
DataFrame.to_sql(name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)
Let’s briefly discuss each parameter:
-
- name: The name of the SQL table that you’ll write your DataFrame to.
- con: The SQLAlchemy engine or a DBAPI2 connection. Using SQLAlchemy makes it possible to use any DB supported by that library.
- schema: The name of the SQL schema in the database to write to (optional).
- if_exists: What to do if the table already exists in the database. Options include: ‘fail’ (the default), ‘replace’, and ‘append’.
- index: Whether to write the DataFrame’s index. Defaults to True.
- index_label: The column label(s) to use for the DataFrame’s index. If the DataFrame uses MultiIndex, this should be a sequence. If
None
is given and the DataFrame uses MultiIndex, the index names will be used. - chunksize: Rows will be written in batches of this size at a time. If
None
is given, all rows will be written at once. - dtype: A dictionary of column names to SQLAlchemy types (or strings for the SQLite3 legacy mode). This is used to override the default data type that Pandas infers.
- method: Controls the SQL insertion clause that is used. New in version 0.24.0.
Connecting to Various Databases
The to_sql
function in Pandas requires either a SQLAlchemy engine or a sqlite3 database connection. There are two main methods for establishing these connections:
SQLite: SQLite is a software library that provides a relational database management system. It’s a self-contained, serverless, and zero-configuration database engine. For creating a connection with SQLite, you can use the sqlite3
module that comes with Python’s standard library:
import sqlite3 conn = sqlite3.connect('my_database.db')
SQLAlchemy: SQLAlchemy provides a generalized interface for creating connections with different database systems including MySQL, PostgreSQL, and Oracle.
However, for the purpose of this tutorial, we’ll use SQLAlchemy because SQLAlchemy makes it possible to use any DB supported by that library, including SQLite.
SQLite
To establish a connection to an SQLite database using SQLAlchemy, you can do this:
from sqlalchemy import create_engine engine = create_engine('sqlite:///:memory:')
In this example, we created an engine that uses an SQLite database stored in memory. This is a special case and is great for testing and development since you don’t have to worry about cleaning up a test database.
However, more commonly, you’ll want your SQLite database stored in a file on disk. In that case, you’d simply provide a relative or absolute file path to the create_engine
function instead of ‘:memory:’:
from sqlalchemy import create_engine engine = create_engine('sqlite:///my_database.db')
Here, ‘my_database.db’ is the name of the SQLite database file. If the file doesn’t exist, SQLAlchemy will automatically create it when a connection is made.
MySQL
The most common driver for connecting to MySQL is mysqlclient
, but others like PyMySQL
are also available. Here’s how to establish a connection to a MySQL database:
from sqlalchemy import create_engine engine = create_engine('mysql+mysqldb://username:password@localhost/db_name')
In this connection string, ‘username’ and ‘password’ are your MySQL username and password, ‘localhost’ is the host where your MySQL server is running, and ‘db_name’ is the name of the database you want to connect to.
If you’re using a different driver, replace ‘mysqldb’ with the appropriate name.
For example, if you’re using PyMySQL, your connection string would look like this:
from sqlalchemy import create_engine engine = create_engine('mysql+pymysql://username:password@localhost/db_name')
Please note, you need to have the necessary MySQL driver installed to establish a connection. For mysqlclient
, you can install it via pip:
pip install mysqlclient
And for PyMySQL:
pip install PyMySQL
PostgreSQL
You can use the psycopg2
library as a driver for SQLAlchemy to connect to PostgreSQL database.
Here’s how you can connect to a PostgreSQL database:
from sqlalchemy import create_engine engine = create_engine('postgresql://username:password@localhost/db_name')
In this connection string, ‘username’ and ‘password’ are your PostgreSQL username and password, ‘localhost’ is the host where your PostgreSQL server is running, and ‘db_name’ is the name of the database you want to connect to.
Note: You need to have the psycopg2
library installed to establish the connection. You can install it using pip:
pip install psycopg2
If you have trouble installing psycopg2
, try installing psycopg2-binary
instead:
pip install psycopg2-binary
Oracle
You can use the cx_Oracle
library to connect to an Oracle database. Here’s how to establish a connection to an Oracle database:
from sqlalchemy import create_engine engine = create_engine('oracle+cx_oracle://username:password@localhost:1521/sid')
In this connection string, ‘username’ and ‘password’ are your Oracle username and password, ‘localhost’ is the host where your Oracle server is running, ‘1521’ is the port where your Oracle server is listening, and ‘sid’ is the System ID (SID) of the Oracle database you want to connect to.
Please note that you need to have the cx_Oracle
library installed to establish the connection. You can install it using pip:
pip install cx_Oracle
Remember, you also need to have the Oracle Client libraries installed on your machine. You can download them from the Oracle website.
Export DataFrame to SQL using Pandas to_sql
Now that we know how to connect to various databases using SQLAlchemy, let’s dive into how we can use the to_sql
function to write data from a Pandas DataFrame to a SQL database.
Let’s start by creating a simple DataFrame:
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 24, 35, 32], 'City': ['New York', 'Paris', 'Berlin', 'London']} df = pd.DataFrame(data) print(df)
Output:
Name Age City 0 John 28 New York 1 Anna 24 Paris 2 Peter 35 Berlin 3 Linda 32 London
Now, let’s assume we have an SQLite database and we want to write this DataFrame to a table in the database. We’ll need to create a SQLAlchemy engine first:
from sqlalchemy import create_engine engine = create_engine('sqlite:///:memory:')
Then, we can use to_sql
to write the DataFrame to the SQLite database:
df.to_sql('People', con=engine, if_exists='replace', index=False)
In this example, we’re writing the DataFrame to a table named ‘People’. The if_exists='replace'
argument tells Pandas to replace the table if it already exists. The index=False
argument tells Pandas not to write the DataFrame’s index to the SQL table.
You can then verify that the data has been written to the SQLite table by using read_sql:
result_df = pd.read_sql('People', con=engine) print(result_df)
Output:
Name Age City 0 John 28 New York 1 Anna 24 Paris 2 Peter 35 Berlin 3 Linda 32 London
As you can see, we’ve successfully written the DataFrame to a SQL table and fetched the data back from the table.
Understanding the if_exists parameter
The if_exists
parameter in the to_sql
function controls how to behave if the table already exists in the database. This parameter can accept three possible values: ‘fail’, ‘replace’, and ‘append’.
Choosing the correct value for if_exists
is very important as it can lead to different outcomes. You must consider carefully whether you want to replace the entire table, append to it, or fail in the case the table exists.
Customizing Data Types and Conversions
When writing a DataFrame to a SQL database using to_sql
, pandas automatically converts the DataFrame’s data types to SQL types.
However, there may be times when you want to specify custom SQL data types for certain columns. You can do this using the dtype
argument of to_sql
.
The dtype
argument should be a dictionary where the keys are the column names and the values should be the SQLAlchemy types or strings for the SQLite3 legacy mode.
For instance, let’s say we have a DataFrame with columns ‘Name’ (strings), ‘Age’ (integers), and ‘Salary’ (floats).
We can specify that ‘Age’ should be stored as a integer in the SQL database, and ‘Salary’ should be stored as an float like this:
from sqlalchemy.types import Integer, Float df.to_sql('Employees', con=engine, if_exists='replace', index=False, dtype={'Name': String, 'Age': Integer, 'Salary': Float})
In this example, ‘Name’ will be stored as a string, ‘Age’ as a integer, and ‘Salary’ as an float in the SQL table.
Note that when using dtype
, pandas will try to match each DataFrame column’s type with SQLAlchemy if supported. If not, it will use the default SQL type for that column’s data type.
It’s also important to remember that not all SQL databases support all data types, so the dtype
argument should be used with care.
Index Management
In pandas, when a DataFrame is created, an index is also generated automatically. This index can be based on one or more columns, forming a single or multiple index, respectively.
The index is especially useful when we want to fetch or manipulate data.
In the to_sql
function, we can control whether or not we want to write the DataFrame’s index to the SQL database by using the index
parameter.
If index=True
(which is the default), the index will be written to the SQL database as a separate column. If index=False
, the index will not be written.
df.to_sql('Employees', con=engine, if_exists='replace', index=False)
MultiIndex Support
MultiIndex or hierarchical index is a special type of index that allows you to have multiple levels of indices on a single axis.
If the DataFrame uses MultiIndex, it can be written to the SQL database using the to_sql
function, with each level of the index forming a separate column in the database.
Let’s see an example where we have a DataFrame with a MultiIndex:
index = pd.MultiIndex.from_tuples([(1, 'red'), (1, 'blue'), (2, 'red'), (2, 'blue')], names=['number', 'color']) df = pd.DataFrame({'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 24, 35, 32]}, index=index) df.to_sql('People', con=engine, if_exists='replace')
In this case, the DataFrame uses a MultiIndex (‘number’ and ‘color’), and each level of the index will be written as a separate column in the SQL database when using to_sql
.
Managing Insertion (None, Multi, callable)
You can use the method
parameter when using the to_sql
method to control how the data insertion into the SQL database is performed.
There are three main options you can use: None
, multi
, or a callable
.
Let’s start by creating a simple DataFrame:
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Age': [28, 24, 35, 32], 'City': ['New York', 'Paris', 'Berlin', 'London']} df = pd.DataFrame(data)
None
When method=None
, each row of the DataFrame will be written to the SQL table individually. This is the default behavior:
df.to_sql('People', con=engine, if_exists='replace', index=False, method=None)
Multi
When method='multi'
, multiple rows will be written at once. This can improve performance, especially for larger DataFrames:
df.to_sql('People', con=engine, if_exists='replace', index=False, method='multi')
Callable (Custom Insertion)
For the callable method, you can define a custom function to modify the SQL insert statement:
def insert_with_replace(dataframe, conn, keys, data_iter): raw_conn = conn.raw_connection() cursor = raw_conn.cursor() insert_statement = "INSERT OR REPLACE INTO People VALUES (?, ?, ?)" cursor.executemany(insert_statement, data_iter) raw_conn.commit() df.to_sql('People', con=engine, if_exists='replace', index=False, method=insert_with_replace)
In this example, we’ve defined a custom function that replaces the standard SQL INSERT statement with an “INSERT OR REPLACE” statement.
This will insert new rows into the ‘People’ table, but if a row already exists with the same primary key, that row will be replaced.
In all three methods, we can verify the data has been written by fetching it back from the table:
result_df = pd.read_sql('People', con=engine) print(result_df)
Output:
Name Age City 0 John 28 New York 1 Anna 24 Paris 2 Peter 35 Berlin 3 Linda 32 London
Dealing with Large Data Sets
When you are dealing with a large DataFrame, writing the entire DataFrame to the SQL database all at once might not be feasible due to memory constraints. In such cases, pandas provides an option to write data in chunks.
You can use the chunksize
parameter of the to_sql
function to define the number of rows that should be written at a time.
Let’s consider a large DataFrame df
with 100,000 rows. We can write this data to a SQL database in chunks of 10,000 rows like this:
df.to_sql('LargeTable', con=engine, if_exists='replace', index=False, chunksize=10000)
In this example, pandas will insert 10,000 rows at a time into the ‘LargeTable’.
However, you should be aware that using chunksize
might increase the total time it takes to write the entire DataFrame to the SQL database, as Pandas needs to create a new SQL insert statement for each chunk.
Further Reading
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.