Export Pandas DataFrame to Google BigQuery using to_gbq
The to_gbq
function allows you to upload data from a Pandas into a BigQuery table.
In this tutorial, you’ll learn how to export data from a Pandas DataFrame to BigQuery using the to_gbq
function.
- 1 Installing Required Libraries
- 2 Setting up Google Cloud SDK
- 3 to_gbq Syntax and Parameters
- 4 Specifying Dataset and Table in destination_table
- 5 Using the if_exists Parameter
- 6 Manual Table Schema Definition
- 7 Handling Nested and Repeated Columns
- 8 Chunking Large DataFrames
- 9 Track Progress of Upload Using Progress Bar
- 10 Logging
- 11 Real-world Example
- 12 Resource
Installing Required Libraries
First, you need to ensure you have the pandas-gbq
library installed.
This library provides essential functions like to_gbq
that we will use.
!pip install pandas-gbq
Output:
Collecting pandas-gbq ... Successfully installed pandas-gbq-x.x.x
Setting up Google Cloud SDK
If you haven’t installed the Google Cloud SDK, follow the instructions:
!curl https://sdk.cloud.google.com | bash
Restart your shell or terminal to add gcloud
to your path.
Once the SDK is installed, you can authenticate using the following:
!gcloud auth login
You’ll be prompted to log in using your Google credentials. After successfully logging in, your credentials will be stored locally and used for future requests.
For programmatic access via pandas-gbq
, you need to set up application default credentials. Run:
!gcloud auth application-default login
You’ll be guided through the authentication process. After this step, your Python scripts using pandas-gbq
can interact with BigQuery.
to_gbq Syntax and Parameters
The to_gbq
function syntax is as follows:
DataFrame.to_gbq(destination_table, project_id=None, chunk_size=None, reauth=False, if_exists='fail', auth_local_webserver=False, table_schema=None, location=None, progress_bar=True, credentials=None)
Parameters:
- destination_table (str): The name of the table to which you want to write the data in ‘dataset.tablename’ format.
- project_id (str, optional): Your Google Cloud Project ID. If not provided, it will be inferred from the environment.
- chunk_size (int, optional): Number of rows to be inserted in each chunk from the DataFrame. By default, it inserts all rows at once.
- reauth (bool, default False): Force Google user to re-authenticate, which can be useful if multiple accounts are used.
- if_exists (str, default ‘fail’): Behavior when the table already exists. Options include ‘fail’, ‘replace’, and ‘append’.
- auth_local_webserver (bool, default False): Use the local webserver flow for authentication.
- table_schema (list of dicts or pandas.DataFrame.schema, optional): Define the schema for the BigQuery table if needed.
- location (str, optional): The geographic location of the table. Defaults to the location set in your GCP account.
- progress_bar (bool, default True): Show a progress bar for uploads.
- credentials (google.auth.credentials.Credentials, optional): The credentials to be used for authentication with GCP. By default, it uses application default credentials.
Specifying Dataset and Table in destination_table
The destination_table
string should follow the ‘dataset.tablename’ format. Let’s take an example:
If you want to upload data to the ‘monthly_sales’ table inside the ‘sales_data’ dataset, your destination_table
string will be:
destination_table = 'sales_data.monthly_sales'
When using to_gbq
, it will look something like this:
df.to_gbq(destination_table='sales_data.monthly_sales', project_id='your_project_id')
Note: If the specified table doesn’t exist, to_gbq
will create it for you (based on the behavior of the if_exists
parameter).
Using the if_exists
Parameter
The if_exists
parameter in the to_gbq
method controls the behavior when you try to upload a DataFrame to a BigQuery table that already exists.
Default Behaviour if_exists='fail'
This is the default behavior. If the table already exists in BigQuery, the upload operation will fail, and no changes will be made.
Example:
df.to_gbq(destination_table='dataset_name.table_name', project_id='your_project_id', if_exists='fail')
Output:
If the table already exists:
TableCreationError: Table dataset_name:table_name already exists.
Replace Existing Data if_exists='replace'
If the table already exists in BigQuery, it will be overwritten with the new data. Essentially, the existing table will be dropped, and a new one will be created.
Example:
df.to_gbq(destination_table='dataset_name.table_name', project_id='your_project_id', if_exists='replace')
Output:
Whether or not the table exists:
1 out of 1 chunks uploaded.
Appending Data to an Existing Table if_exists='append'
If the table already exists in BigQuery, the data from the DataFrame will be appended to the table.
If the table doesn’t exist, it will be created.
Example:
df.to_gbq(destination_table='dataset_name.table_name', project_id='your_project_id', if_exists='append')
Output:
Whether or not the table exists:
1 out of 1 chunks uploaded.
Manual Table Schema Definition
The schema is defined as a list of dictionaries, where each dictionary represents a column, its data type, and optional properties.
Each dictionary generally has:
- name: Name of the column.
- type: Data type of the column in BigQuery format.
- mode (optional): Defines if the column can have NULL values or if it’s a repeated field. Default is ‘NULLABLE’.
Assume you have the following DataFrame:
import pandas as pd data = { 'Product': ['A', 'B', 'C'], 'Sales': [100, 150, 200], 'Date': ['2023-01-01', '2023-01-02', '2023-01-03'] } df = pd.DataFrame(data)
Define the Schema
For our DataFrame, the manual schema definition looks like:
table_schema = [ {'name': 'Product', 'type': 'STRING'}, {'name': 'Sales', 'type': 'INT64'}, {'name': 'Date', 'type': 'DATE'} ]
Use to_gbq
with Specified Schema
You can use the table_schema
parameter to specify the schema:
df.to_gbq(destination_table='sales_data.product_sales', project_id='your_project_id', table_schema=table_schema, if_exists='replace')
Output:
1 out of 1 chunks uploaded.
Handling Nested and Repeated Columns
You can use table_schema
parameter to specify the schema for nested columns or repeated columns.
Example DataFrame with Nested Data:
import pandas as pd data = { 'Product': ['A', 'B'], 'Details': [{'Color': 'Red', 'Size': 'Large'}, {'Color': 'Blue', 'Size': 'Medium'}] } df = pd.DataFrame(data)
Schema Definition:
For nested columns, you need a ‘fields’ key in your schema dictionary:
table_schema = [ {'name': 'Product', 'type': 'STRING'}, {'name': 'Details', 'type': 'RECORD', 'fields': [ {'name': 'Color', 'type': 'STRING'}, {'name': 'Size', 'type': 'STRING'} ]} ]
DataFrame with Repeated Data:
Repeated columns allow you to have arrays of data for a particular field. It’s useful for one-to-many relationships.
data = { 'Product': ['A', 'B'], 'Tags': [['Outdoor', 'Summer'], ['Indoor', 'Winter']] } df = pd.DataFrame(data)
Schema Definition
For repeated columns, use ‘mode’: ‘REPEATED’:
table_schema = [ {'name': 'Product', 'type': 'STRING'}, {'name': 'Tags', 'type': 'STRING', 'mode': 'REPEATED'} ]
Now you use to_gbq
like this:
df.to_gbq(destination_table='sales_data.product_info', project_id='your_project_id', table_schema=table_schema, if_exists='replace')
Chunking Large DataFrames
The chunksize
parameter in the to_gbq
function specifies the number of rows per chunk. Here’s how you can use it:
import pandas as pd import pandas_gbq # Sample large DataFrame data = {'column1': range(1, 100001), 'column2': range(100001, 1, -1)} large_df = pd.DataFrame(data) # Using chunksize with to_gbq pandas_gbq.to_gbq(large_df, 'your_dataset.your_table', project_id='your_project_id', if_exists='replace', chunksize=5000)
In this example, the large DataFrame large_df
with 100,000 rows is broken into chunks of 5,000 rows each, resulting in 20 separate uploads to BigQuery.
Track Progress of Upload Using Progress Bar
The progress_bar
parameter in to_gbq
function allows you to track the progress of your upload.
df.to_gbq(destination_table='dataset_name.table_name', project_id='your_project_id', progress_bar=True)
Output:
A dynamic progress bar in your console or Jupyter notebook that updates as chunks of data get uploaded:
Uploading: 100%|█████████████████████████| 5/5 [00:05<00:00, 1.00s/rows]
Logging
You can set up logging to get detailed information about the upload process, particularly useful for debugging.
First, configure Python’s logging module:
import logging logging.basicConfig(level=logging.INFO)
With this setup, pandas_gbq
will provide detailed logs about the upload process.
Output:
INFO:pandas_gbq.gbq:Uploading 1000000 rows to table dataset_name.table_name
Real-world Example
Imagine you run an e-commerce website with a PostgreSQL database holding transaction records.
You want to sync this transactional data to BigQuery every day at midnight to run complex analytics, generate reports, and perhaps use it with other GCP tools like Data Studio.
Extracting Data from PostgreSQL
First, you’ll need to extract the relevant data from your local PostgreSQL instance:
import psycopg2 import pandas as pd conn = psycopg2.connect( host="localhost", database="your_database", user="your_user", password="your_password" ) query = "SELECT * FROM transactions WHERE transaction_date >= current_date - interval '1 day';" df = pd.read_sql_query(query, conn) conn.close()
Loading Data to BigQuery
Now that you have the data in a pandas DataFrame, use to_gbq
:
import pandas_gbq destination = 'your_dataset.transactions' df.to_gbq(destination_table=destination, project_id='your_project_id', if_exists='append')
Automation
To sync daily:
Use a task scheduler:
-
- On Linux, use
cron
. - On Windows, use Task Scheduler.
- If you’re leveraging GCP, consider Cloud Functions or Cloud Scheduler.
- On Linux, use
For this job:
Linux cron example:
0 0 * * * /usr/bin/python3 /path_to_script/your_script.py
This will run the script every day at midnight.
Resource
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.