Read SQL Query/Table into DataFrame using Pandas read_sql
The read_sql
function allows you to load data from a SQL database directly into a Pandas DataFrame.
It allows you to parse and execute SQL queries directly or read an entire table into a DataFrame. By using pandas.read_sql
, you’re making a seamless bridge between your SQL database and Pandas.
- 1 Syntax of Pandas read_sql function
- 2 Establish a Connection
- 3 Basic SQL Database Queries
- 4 Reading Table Data
- 5 Filter rows
- 6 Extracting Specific Columns
- 7 Setting the DataFrame Index
- 8 Parameterized Queries
- 9 Parsing Date Columns
- 10 Handling Large Datasets with chunks
- 11 Executing Stored Procedures
Syntax of Pandas read_sql function
The basic syntax of the read_sql
function in Pandas is as follows:
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
Here’s what each parameter signifies:
sql
: This parameter can be a string SQL query or a SQLalchemy Selectable (which represents a compiled SQL statement) to be executed.con
: This is your database connection object, which could be a SQLAlchemy engine or a sqlite3 connection.index_col
: An optional parameter that specifies a column to set as index.coerce_float
: When set to True (default), it attempts to convert non-string data to floating point numbers.params
: An optional list or dictionary of parameters to pass into thesql
query.parse_dates
: An optional parameter to parse columns into datetime.columns
: If you’re reading a table (not a query), this allows you to select which columns to load.chunksize
: If specified,read_sql
will return an iterator where chunksize is the number of rows per chunk.
Establish a Connection
In the Pandas read_sql
function, you can establish a connection to a SQL database in several ways. However, the two most common ways are:
Using sqlite3 Connection
If you’re working with SQLite databases, you can also use a sqlite3 connection directly. Here’s an example:
import sqlite3 import pandas as pd con = sqlite3.connect('school_db.sqlite')
Using SQLAlchemy Engine
You can create an SQLAlchemy engine and use it as the connection argument in the read_sql
function. This method has the advantage of being able to work with many different types of SQL databases.
First, you need to install the SQLAlchemy engine:
pip install sqlalchemy
Here’s an example to connect to a SQLite database:
from sqlalchemy import create_engine import pandas as pd engine = create_engine('sqlite:///school_db.sqlite')
MySQL
To connect to a MySQL database, you need to install the mysql-connector-python package, which is the MySQL connector for Python that SQLAlchemy will use to communicate with the MySQL server:
pip install mysql-connector-python
Once you have the connector installed, you can create an engine for your MySQL database:
from sqlalchemy import create_engine engine = create_engine('mysql+mysqlconnector://USERNAME:PASSWORD@HOST/DB_NAME')
PostgreSQL
For PostgreSQL, you’ll want to use the psycopg2-binary package, which is the most popular PostgreSQL adapter for Python.
pip install psycopg2-binary
Once the adapter is installed, you can create an engine for your PostgreSQL database:
from sqlalchemy import create_engine engine = create_engine('postgresql+psycopg2://USERNAME:PASSWORD@HOST/DB_NAME')
Oracle
To interact with an Oracle database using Python, we can utilize the cx_Oracle library.
First, you’ll need to install the cx_Oracle package with pip:
pip install cx_Oracle
After installing cx_Oracle, you can establish a connection to an Oracle database:
from sqlalchemy import create_engine engine = create_engine('oracle+cx_oracle://USERNAME:PASSWORD@HOST:PORT/DB_NAME')
Basic SQL Database Queries
First, you need to import the necessary libraries and establish a connection to the database:
import pandas as pd import sqlite3 con = sqlite3.connect("school_database.db")
Now, let’s assume there is a table called ‘users’ in the database. We can use read_sql
to load the entire ‘users’ table into a DataFrame:
df = pd.read_sql("SELECT * FROM users", con) con.close() print(df.head())
Output:
id name age 0 1 Alice 20 1 2 Bob 22 2 3 Charlie 19 3 4 David 21 4 5 Emma 20
In the code above, we executed the SQL query “SELECT * FROM users” which selects all columns and rows from the ‘users’ table.
The read_sql
function executed this query and loaded the result into a Pandas dataframe, df
.
And we don’t forget to close the connection after we read our data by calling con.close()
to avoid resources consumption.
The df.head()
function then prints the first 5 rows of the dataframe.
Reading Table Data
The read_sql
function in Pandas allows us to fetch data from a SQL database into a DataFrame object, using a SQL query string as we saw above or a table name.
When we provide a table name to the read_sql
function, it will read the entire table and convert it into a DataFrame.
Let’s see how to do this with an example. Assume we have a table named ‘students’ in our database.
df = pd.read_sql('students', con) con.close() print(df.head())
Output:
id name age grade 0 1 Alice 20 10 1 2 Bob 22 11 2 3 Charlie 19 10 3 4 David 21 12 4 5 Emma 20 11
In this example, we’re using the read_sql
function with the ‘students’ table as the SQL query. The function fetches all rows from the ‘students’ table and converts them into a DataFrame.
Filter rows
Pandas’ read_sql
allows you to run SQL queries, which means you can add WHERE
clauses to filter rows based on specific conditions.
Let’s continue with our ‘users’ table and filter it to include only users who are older than 20:
df = pd.read_sql("SELECT * FROM users WHERE age > 20", con) con.close() print(df)
Output:
id name age 0 2 Bob 22 1 4 David 21
In this code, we execute the SQL query “SELECT * FROM users WHERE age > 20”. This query selects all columns from the ‘users’ table, but only rows where the ‘age’ is greater than 20.
The read_sql
function then executes this query and loads the result into the dataframe df
.
We can also add multiple conditions.
For example, let’s filter the ‘users’ table to include only users who are older than 20 and whose names start with ‘D’:
df = pd.read_sql("SELECT * FROM users WHERE age > 20 AND name LIKE 'D%'", con) con.close() print(df)
Output:
id name age 0 4 David 21
In this code, the SQL query “SELECT * FROM users WHERE age > 20 AND name LIKE ‘D%'” selects all columns from the ‘users’ table, but only rows where the ‘age’ is greater than 20 and the ‘name’ starts with ‘D’.
The read_sql
function executes this query and loads the result into the dataframe df
.
Extracting Specific Columns
You can use the columns
parameter in the read_sql
function if you don’t want to read an entire table but only specific columns.
This is a simple and efficient way to load only the data you need.
Let’s continue with our ‘users’ table and extract only the ‘name’ and ‘age’ columns:
df = pd.read_sql('users', engine, columns=['name', 'age']) con.close() print(df.head())
Output:
name age 0 Alice 20 1 Bob 22 2 Charlie 19 3 David 21 4 Emma 20
In this code, the read_sql
function reads only the ‘name’ and ‘age’ columns from the ‘users’ table. The rest of the code works the same as before.
Setting the DataFrame Index
The index_col
parameter of the read_sql
function specifies which column should be used as the index of the DataFrame.
Let’s consider an example where we have a ‘students’ table with an ‘id’ column that we want to use as the index:
df = pd.read_sql("SELECT * FROM students", con, index_col='id') con.close() print(df.head())
Output:
name age grade id 1 Alice 20 10 2 Bob 22 11 3 Charlie 19 10 4 David 21 12 5 Emma 20 11
In this code, we are using pd.read_sql
to execute the SQL query “SELECT * FROM students” and load the result into a pandas dataframe.
We also set index_col='id'
to use the ‘id’ column from the students table as the index of the DataFrame.
Parameterized Queries
Parameterized queries are a vital feature when it comes to preventing SQL Injection attacks and creating dynamic queries.
The read_sql
function in pandas supports parameterized queries through the params
argument.
Let’s consider a scenario where we need to filter the ‘users’ table based on user age, but the specific age will be decided at runtime. Here’s how we can achieve this:
age_to_filter = 20 df = pd.read_sql("SELECT * FROM users WHERE age > ?", con, params=[age_to_filter]) con.close() print(df)
Output:
id name age 0 2 Bob 22 1 4 David 21
In this code, we execute the SQL query “SELECT FROM users WHERE age > ?”, where “?” is a placeholder that will be replaced by the params
argument.
We pass the age_to_filter
variable as the params
argument.
Therefore, the executed SQL query is “SELECT FROM users WHERE age > 20”, which selects all columns from the ‘users’ table where ‘age’ is greater than 20.
The read_sql
function executes this query and loads the result into the dataframe df
.
We can also use a dictionary or a tuple to pass the parameters:
df = pd.read_sql("SELECT * FROM users WHERE age > :age_to_filter", con, params={"age_to_filter": 20}) con.close() print(df)
Output:
id name age 0 2 Bob 22 1 4 David 21
In this case, the SQL query uses a named placeholder “:age_to_filter”, which is replaced by the value provided in the params
dictionary.
Parsing Date Columns
You can use the parse_dates
parameter of the read_sql
function to parse date columns and that’s because Pandas loads date columns as objects By default.
The parse_dates
parameter accepts a list of columns to be parsed as dates.
Let’s consider an example where we have a ‘students’ table with a ‘dob’ (date of birth) column:
df = pd.read_sql("SELECT * FROM students", con, parse_dates=['dob'])
con.close()
print(df.head())
print("\nData type of 'dob' column:", df['dob'].dtypes)
Output:
id name dob grade 0 1 Alice 2003-05-15 10 1 2 Bob 2001-06-20 11 2 3 Charlie 2004-09-10 10 3 4 David 2002-11-25 12 4 5 Emma 2003-08-05 11 Data type of 'dob' column: datetime64[ns]
Once the ‘dob’ column is in the datetime format, you can perform various datetime operations on this column.
For instance, you could extract the year of birth, calculate age, select rows between certain dates, and so forth.
Handling Large Datasets with chunks
When dealing with large datasets that don’t fit into memory, you can use the chunksize
parameter in read_sql
.
This returns an iterable object of type TextFileReader
, allowing you to process your data in chunks.
If we have a large ‘users’ table, and we want to process the data in chunks of 500 rows at a time:
from sqlalchemy import create_engine engine = create_engine('mysql+mysqlconnector://USERNAME:PASSWORD@HOST/DB_NAME') #Enable streaming to load only records when Pandas fetch them con = engine.connect().execution_options( stream_results=True) chunks = pd.read_sql("SELECT * FROM users", con, chunksize=500) for chunk in chunks: print(chunk)
Note that we enabled the streaming by setting stream_results=True
to load only records when Pandas fetches them instead of loading all records.
You can process each chunk as needed within the loop.
Suppose we want to calculate the average age for each chunk. We can modify our loop to do that:
chunks = pd.read_sql("SELECT * FROM users", con, chunksize=500) for i, chunk in enumerate(chunks, start=1): avg_age = chunk['age'].mean() print(f"Average age for chunk {i}: {avg_age}")
This will print the average age for each group of 500 rows.
Using chunksize
is a powerful tool that allows you to work with datasets that are larger than your machine’s memory.
Executing Stored Procedures
Stored procedures are a powerful feature of SQL databases. They are sets of SQL statements that perform a specific task and are stored in the database itself.
To call a stored procedure in Python with read_sql
, you can write the call as a regular SQL statement. Here’s an example:
df = pd.read_sql("CALL get_students_by_age(20)", con) con.close() print(df.head())
In this code, we use the SQL CALL
statement to execute the stored procedure get_students_by_age
, passing 20 as the argument.
The result of the stored procedure is loaded into the Pandas DataFrame df
.
Note: The specific syntax for calling stored procedures can vary depending on the SQL database you are using.
Calling stored procedures can be a very efficient way to perform complex database operations, as the operations are performed directly on the database server.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.