Read SQL Query/Table into DataFrame using Pandas read_sql

The read_sql function allows you to load data from a SQL database directly into a Pandas DataFrame.

It allows you to parse and execute SQL queries directly or read an entire table into a DataFrame. By using pandas.read_sql, you’re making a seamless bridge between your SQL database and Pandas.

 

 

Syntax of Pandas read_sql function

The basic syntax of the read_sql function in Pandas is as follows:

pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)

Here’s what each parameter signifies:

  • sql: This parameter can be a string SQL query or a SQLalchemy Selectable (which represents a compiled SQL statement) to be executed.
  • con: This is your database connection object, which could be a SQLAlchemy engine or a sqlite3 connection.
  • index_col: An optional parameter that specifies a column to set as index.
  • coerce_float: When set to True (default), it attempts to convert non-string data to floating point numbers.
  • params: An optional list or dictionary of parameters to pass into the sql query.
  • parse_dates: An optional parameter to parse columns into datetime.
  • columns: If you’re reading a table (not a query), this allows you to select which columns to load.
  • chunksize: If specified, read_sql will return an iterator where chunksize is the number of rows per chunk.

 

Establish a Connection

In the Pandas read_sql function, you can establish a connection to a SQL database in several ways. However, the two most common ways are:

Using sqlite3 Connection

If you’re working with SQLite databases, you can also use a sqlite3 connection directly. Here’s an example:

import sqlite3
import pandas as pd
con = sqlite3.connect('school_db.sqlite')

Using SQLAlchemy Engine

You can create an SQLAlchemy engine and use it as the connection argument in the read_sql function. This method has the advantage of being able to work with many different types of SQL databases.

First, you need to install the SQLAlchemy engine:

pip install sqlalchemy

Here’s an example to connect to a SQLite database:

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///school_db.sqlite')

MySQL

To connect to a MySQL database, you need to install the mysql-connector-python package, which is the MySQL connector for Python that SQLAlchemy will use to communicate with the MySQL server:

pip install mysql-connector-python

Once you have the connector installed, you can create an engine for your MySQL database:

from sqlalchemy import create_engine
engine = create_engine('mysql+mysqlconnector://USERNAME:PASSWORD@HOST/DB_NAME')

PostgreSQL

For PostgreSQL, you’ll want to use the psycopg2-binary package, which is the most popular PostgreSQL adapter for Python.

pip install psycopg2-binary

Once the adapter is installed, you can create an engine for your PostgreSQL database:

from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://USERNAME:PASSWORD@HOST/DB_NAME')

Oracle

To interact with an Oracle database using Python, we can utilize the cx_Oracle library.
First, you’ll need to install the cx_Oracle package with pip:

pip install cx_Oracle

After installing cx_Oracle, you can establish a connection to an Oracle database:

from sqlalchemy import create_engine
engine = create_engine('oracle+cx_oracle://USERNAME:PASSWORD@HOST:PORT/DB_NAME')

 

Basic SQL Database Queries

First, you need to import the necessary libraries and establish a connection to the database:

import pandas as pd
import sqlite3
con = sqlite3.connect("school_database.db")

Now, let’s assume there is a table called ‘users’ in the database. We can use read_sql to load the entire ‘users’ table into a DataFrame:

df = pd.read_sql("SELECT * FROM users", con)
con.close()
print(df.head())

Output:

   id     name  age
0   1    Alice   20
1   2      Bob   22
2   3  Charlie   19
3   4    David   21
4   5     Emma   20

In the code above, we executed the SQL query “SELECT * FROM users” which selects all columns and rows from the ‘users’ table.
The read_sql function executed this query and loaded the result into a Pandas dataframe, df.
And we don’t forget to close the connection after we read our data by calling con.close()to avoid resources consumption.
The df.head() function then prints the first 5 rows of the dataframe.

 

Reading Table Data

The read_sql function in Pandas allows us to fetch data from a SQL database into a DataFrame object, using a SQL query string as we saw above or a table name.
When we provide a table name to the read_sql function, it will read the entire table and convert it into a DataFrame.

Let’s see how to do this with an example. Assume we have a table named ‘students’ in our database.

df = pd.read_sql('students', con)
con.close()
print(df.head())

Output:

   id     name  age  grade
0   1    Alice   20     10
1   2      Bob   22     11
2   3  Charlie   19     10
3   4    David   21     12
4   5     Emma   20     11

In this example, we’re using the read_sql function with the ‘students’ table as the SQL query. The function fetches all rows from the ‘students’ table and converts them into a DataFrame.

 

Filter rows

Pandas’ read_sql allows you to run SQL queries, which means you can add WHERE clauses to filter rows based on specific conditions.

Let’s continue with our ‘users’ table and filter it to include only users who are older than 20:

df = pd.read_sql("SELECT * FROM users WHERE age > 20", con)
con.close()
print(df)

Output:

   id   name  age
0   2    Bob   22
1   4  David   21

In this code, we execute the SQL query “SELECT * FROM users WHERE age > 20”. This query selects all columns from the ‘users’ table, but only rows where the ‘age’ is greater than 20.

The read_sql function then executes this query and loads the result into the dataframe df.
We can also add multiple conditions.

For example, let’s filter the ‘users’ table to include only users who are older than 20 and whose names start with ‘D’:

df = pd.read_sql("SELECT * FROM users WHERE age > 20 AND name LIKE 'D%'", con)
con.close()
print(df)

Output:

   id   name  age
0   4  David   21

In this code, the SQL query “SELECT * FROM users WHERE age > 20 AND name LIKE ‘D%'” selects all columns from the ‘users’ table, but only rows where the ‘age’ is greater than 20 and the ‘name’ starts with ‘D’.

The read_sql function executes this query and loads the result into the dataframe df.

 

Extracting Specific Columns

You can use the columns parameter in the read_sql function if you don’t want to read an entire table but only specific columns.

This is a simple and efficient way to load only the data you need.
Let’s continue with our ‘users’ table and extract only the ‘name’ and ‘age’ columns:

df = pd.read_sql('users', engine, columns=['name', 'age'])
con.close()
print(df.head())

Output:

      name  age
0    Alice   20
1      Bob   22
2  Charlie   19
3    David   21
4     Emma   20

In this code, the read_sql function reads only the ‘name’ and ‘age’ columns from the ‘users’ table. The rest of the code works the same as before.

 

Setting the DataFrame Index

The index_col parameter of the read_sql function specifies which column should be used as the index of the DataFrame.

Let’s consider an example where we have a ‘students’ table with an ‘id’ column that we want to use as the index:

df = pd.read_sql("SELECT * FROM students", con, index_col='id')
con.close()
print(df.head())

Output:

      name  age  grade
id                     
1    Alice   20     10
2      Bob   22     11
3  Charlie   19     10
4    David   21     12
5     Emma   20     11

In this code, we are using pd.read_sql to execute the SQL query “SELECT * FROM students” and load the result into a pandas dataframe.

We also set index_col='id' to use the ‘id’ column from the students table as the index of the DataFrame.

 

Parameterized Queries

Parameterized queries are a vital feature when it comes to preventing SQL Injection attacks and creating dynamic queries.

The read_sql function in pandas supports parameterized queries through the params argument.
Let’s consider a scenario where we need to filter the ‘users’ table based on user age, but the specific age will be decided at runtime. Here’s how we can achieve this:

age_to_filter = 20
df = pd.read_sql("SELECT * FROM users WHERE age > ?", con, params=[age_to_filter])
con.close()
print(df)

Output:

   id   name  age
0   2    Bob   22
1   4  David   21

In this code, we execute the SQL query “SELECT FROM users WHERE age > ?”, where “?” is a placeholder that will be replaced by the params argument.

We pass the age_to_filter variable as the params argument.

Therefore, the executed SQL query is “SELECT FROM users WHERE age > 20”, which selects all columns from the ‘users’ table where ‘age’ is greater than 20.

The read_sql function executes this query and loads the result into the dataframe df.
We can also use a dictionary or a tuple to pass the parameters:

df = pd.read_sql("SELECT * FROM users WHERE age > :age_to_filter", con, params={"age_to_filter": 20})
con.close()
print(df)

Output:

   id   name  age
0   2    Bob   22
1   4  David   21

In this case, the SQL query uses a named placeholder “:age_to_filter”, which is replaced by the value provided in the params dictionary.

 

Parsing Date Columns

You can use the parse_dates parameter of the read_sql function to parse date columns and that’s because Pandas loads date columns as objects By default.

The parse_dates parameter accepts a list of columns to be parsed as dates.

Let’s consider an example where we have a ‘students’ table with a ‘dob’ (date of birth) column:

df = pd.read_sql("SELECT * FROM students", con, parse_dates=['dob'])
con.close()
print(df.head())
print("\nData type of 'dob' column:", df['dob'].dtypes)

Output:

   id     name        dob  grade
0   1    Alice 2003-05-15     10
1   2      Bob 2001-06-20     11
2   3  Charlie 2004-09-10     10
3   4    David 2002-11-25     12
4   5     Emma 2003-08-05     11

Data type of 'dob' column: datetime64[ns]

Once the ‘dob’ column is in the datetime format, you can perform various datetime operations on this column.

For instance, you could extract the year of birth, calculate age, select rows between certain dates, and so forth.

 

Handling Large Datasets with chunks

When dealing with large datasets that don’t fit into memory, you can use the chunksize parameter in read_sql.

This returns an iterable object of type TextFileReader, allowing you to process your data in chunks.
If we have a large ‘users’ table, and we want to process the data in chunks of 500 rows at a time:

from sqlalchemy import create_engine
engine = create_engine('mysql+mysqlconnector://USERNAME:PASSWORD@HOST/DB_NAME')

#Enable streaming to load only records when Pandas fetch them
con = engine.connect().execution_options( stream_results=True)
chunks = pd.read_sql("SELECT * FROM users", con, chunksize=500)
for chunk in chunks:
    print(chunk)

Note that we enabled the streaming by setting stream_results=True to load only records when Pandas fetches them instead of loading all records.

You can process each chunk as needed within the loop.
Suppose we want to calculate the average age for each chunk. We can modify our loop to do that:

chunks = pd.read_sql("SELECT * FROM users", con, chunksize=500)
for i, chunk in enumerate(chunks, start=1):
    avg_age = chunk['age'].mean()
    print(f"Average age for chunk {i}: {avg_age}")

This will print the average age for each group of 500 rows.
Using chunksize is a powerful tool that allows you to work with datasets that are larger than your machine’s memory.

 

Executing Stored Procedures

Stored procedures are a powerful feature of SQL databases. They are sets of SQL statements that perform a specific task and are stored in the database itself.

To call a stored procedure in Python with read_sql, you can write the call as a regular SQL statement. Here’s an example:

df = pd.read_sql("CALL get_students_by_age(20)", con)
con.close()
print(df.head())

In this code, we use the SQL CALL statement to execute the stored procedure get_students_by_age, passing 20 as the argument.

The result of the stored procedure is loaded into the Pandas DataFrame df.
Note: The specific syntax for calling stored procedures can vary depending on the SQL database you are using.

Calling stored procedures can be a very efficient way to perform complex database operations, as the operations are performed directly on the database server.


Leave a Reply

Your email address will not be published. Required fields are marked *