Export Pandas DataFrame to Google BigQuery using to_gbq

The to_gbq function allows you to upload data from a Pandas into a BigQuery table.

In this tutorial, you’ll learn how to export data from a Pandas DataFrame to BigQuery using the to_gbq function.

 

 

Installing Required Libraries

First, you need to ensure you have the pandas-gbq library installed.

This library provides essential functions like to_gbq that we will use.

!pip install pandas-gbq

Output:

Collecting pandas-gbq
...
Successfully installed pandas-gbq-x.x.x

 

Setting up Google Cloud SDK

If you haven’t installed the Google Cloud SDK, follow the instructions:

!curl https://sdk.cloud.google.com | bash

Restart your shell or terminal to add gcloud to your path.

Once the SDK is installed, you can authenticate using the following:

!gcloud auth login

You’ll be prompted to log in using your Google credentials. After successfully logging in, your credentials will be stored locally and used for future requests.

For programmatic access via pandas-gbq, you need to set up application default credentials. Run:

!gcloud auth application-default login

You’ll be guided through the authentication process. After this step, your Python scripts using pandas-gbq can interact with BigQuery.

 

to_gbq Syntax and Parameters

The to_gbq function syntax is as follows:

DataFrame.to_gbq(destination_table, project_id=None,
 chunk_size=None, reauth=False, if_exists='fail',
 auth_local_webserver=False, table_schema=None,
 location=None, progress_bar=True, credentials=None)

Parameters:

  1. destination_table (str): The name of the table to which you want to write the data in ‘dataset.tablename’ format.
  2. project_id (str, optional): Your Google Cloud Project ID. If not provided, it will be inferred from the environment.
  3. chunk_size (int, optional): Number of rows to be inserted in each chunk from the DataFrame. By default, it inserts all rows at once.
  4. reauth (bool, default False): Force Google user to re-authenticate, which can be useful if multiple accounts are used.
  5. if_exists (str, default ‘fail’): Behavior when the table already exists. Options include ‘fail’, ‘replace’, and ‘append’.
  6. auth_local_webserver (bool, default False): Use the local webserver flow for authentication.
  7. table_schema (list of dicts or pandas.DataFrame.schema, optional): Define the schema for the BigQuery table if needed.
  8. location (str, optional): The geographic location of the table. Defaults to the location set in your GCP account.
  9. progress_bar (bool, default True): Show a progress bar for uploads.
  10. credentials (google.auth.credentials.Credentials, optional): The credentials to be used for authentication with GCP. By default, it uses application default credentials.

 

Specifying Dataset and Table in destination_table

The destination_table string should follow the ‘dataset.tablename’ format. Let’s take an example:

If you want to upload data to the ‘monthly_sales’ table inside the ‘sales_data’ dataset, your destination_table string will be:

destination_table = 'sales_data.monthly_sales'

When using to_gbq, it will look something like this:

df.to_gbq(destination_table='sales_data.monthly_sales', project_id='your_project_id')

Note: If the specified table doesn’t exist, to_gbq will create it for you (based on the behavior of the if_exists parameter).

 

Using the if_exists Parameter

The if_exists parameter in the to_gbq method controls the behavior when you try to upload a DataFrame to a BigQuery table that already exists.

Default Behaviour if_exists='fail'

This is the default behavior. If the table already exists in BigQuery, the upload operation will fail, and no changes will be made.

Example:

df.to_gbq(destination_table='dataset_name.table_name', project_id='your_project_id', if_exists='fail')

Output:

If the table already exists:

TableCreationError: Table dataset_name:table_name already exists.

Replace Existing Data if_exists='replace'

If the table already exists in BigQuery, it will be overwritten with the new data. Essentially, the existing table will be dropped, and a new one will be created.

Example:

df.to_gbq(destination_table='dataset_name.table_name', project_id='your_project_id', if_exists='replace')

Output:

Whether or not the table exists:

1 out of 1 chunks uploaded.

Appending Data to an Existing Table if_exists='append'

If the table already exists in BigQuery, the data from the DataFrame will be appended to the table.

If the table doesn’t exist, it will be created.

Example:

df.to_gbq(destination_table='dataset_name.table_name', project_id='your_project_id', if_exists='append')

Output:

Whether or not the table exists:

1 out of 1 chunks uploaded.

 

Manual Table Schema Definition

The schema is defined as a list of dictionaries, where each dictionary represents a column, its data type, and optional properties.

Each dictionary generally has:

  • name: Name of the column.
  • type: Data type of the column in BigQuery format.
  • mode (optional): Defines if the column can have NULL values or if it’s a repeated field. Default is ‘NULLABLE’.

Assume you have the following DataFrame:

import pandas as pd
data = {
    'Product': ['A', 'B', 'C'],
    'Sales': [100, 150, 200],
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03']
}
df = pd.DataFrame(data)

Define the Schema

For our DataFrame, the manual schema definition looks like:

table_schema = [
    {'name': 'Product', 'type': 'STRING'},
    {'name': 'Sales', 'type': 'INT64'},
    {'name': 'Date', 'type': 'DATE'}
]

Use to_gbq with Specified Schema

You can use the table_schema parameter to specify the schema:

df.to_gbq(destination_table='sales_data.product_sales', project_id='your_project_id', table_schema=table_schema, if_exists='replace')

Output:

1 out of 1 chunks uploaded.

 

Handling Nested and Repeated Columns

You can use table_schema parameter to specify the schema for nested columns or repeated columns.

Example DataFrame with Nested Data:

import pandas as pd
data = {
    'Product': ['A', 'B'],
    'Details': [{'Color': 'Red', 'Size': 'Large'}, {'Color': 'Blue', 'Size': 'Medium'}]
}
df = pd.DataFrame(data)

Schema Definition:

For nested columns, you need a ‘fields’ key in your schema dictionary:

table_schema = [
    {'name': 'Product', 'type': 'STRING'},
    {'name': 'Details', 'type': 'RECORD', 'fields': [
        {'name': 'Color', 'type': 'STRING'},
        {'name': 'Size', 'type': 'STRING'}
    ]}
]

DataFrame with Repeated Data:

Repeated columns allow you to have arrays of data for a particular field. It’s useful for one-to-many relationships.

data = {
    'Product': ['A', 'B'],
    'Tags': [['Outdoor', 'Summer'], ['Indoor', 'Winter']]
}
df = pd.DataFrame(data)

Schema Definition

For repeated columns, use ‘mode’: ‘REPEATED’:

table_schema = [
    {'name': 'Product', 'type': 'STRING'},
    {'name': 'Tags', 'type': 'STRING', 'mode': 'REPEATED'}
]

Now you use to_gbq like this:

df.to_gbq(destination_table='sales_data.product_info', project_id='your_project_id', table_schema=table_schema, if_exists='replace')

 

Chunking Large DataFrames

The chunksize parameter in the to_gbq function specifies the number of rows per chunk. Here’s how you can use it:

import pandas as pd
import pandas_gbq

# Sample large DataFrame
data = {'column1': range(1, 100001), 'column2': range(100001, 1, -1)}
large_df = pd.DataFrame(data)

# Using chunksize with to_gbq
pandas_gbq.to_gbq(large_df, 'your_dataset.your_table', project_id='your_project_id', if_exists='replace', chunksize=5000)

In this example, the large DataFrame large_df with 100,000 rows is broken into chunks of 5,000 rows each, resulting in 20 separate uploads to BigQuery.

 

Track Progress of Upload Using Progress Bar

The progress_bar parameter in to_gbq function allows you to track the progress of your upload.

df.to_gbq(destination_table='dataset_name.table_name', project_id='your_project_id', progress_bar=True)

Output:

A dynamic progress bar in your console or Jupyter notebook that updates as chunks of data get uploaded:

Uploading: 100%|█████████████████████████| 5/5 [00:05<00:00,  1.00s/rows]

 

Logging

You can set up logging to get detailed information about the upload process, particularly useful for debugging.

First, configure Python’s logging module:

import logging
logging.basicConfig(level=logging.INFO)

With this setup, pandas_gbq will provide detailed logs about the upload process.

Output:

INFO:pandas_gbq.gbq:Uploading 1000000 rows to table dataset_name.table_name

 

Real-world Example

Imagine you run an e-commerce website with a PostgreSQL database holding transaction records.

You want to sync this transactional data to BigQuery every day at midnight to run complex analytics, generate reports, and perhaps use it with other GCP tools like Data Studio.

Extracting Data from PostgreSQL

First, you’ll need to extract the relevant data from your local PostgreSQL instance:

import psycopg2
import pandas as pd
conn = psycopg2.connect(
    host="localhost",
    database="your_database",
    user="your_user",
    password="your_password"
)
query = "SELECT * FROM transactions WHERE transaction_date >= current_date - interval '1 day';"
df = pd.read_sql_query(query, conn)
conn.close()

Loading Data to BigQuery

Now that you have the data in a pandas DataFrame, use to_gbq:

import pandas_gbq
destination = 'your_dataset.transactions'
df.to_gbq(destination_table=destination, project_id='your_project_id', if_exists='append')

Automation

To sync daily:

Use a task scheduler:

    • On Linux, use cron.
    • On Windows, use Task Scheduler.
    • If you’re leveraging GCP, consider Cloud Functions or Cloud Scheduler.

For this job:

Linux cron example:

0 0 * * * /usr/bin/python3 /path_to_script/your_script.py

This will run the script every day at midnight.

 

Resource

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_gbq.html

Leave a Reply

Your email address will not be published. Required fields are marked *