Convert Python Pandas DataFrame to JSON using to_json

The to_json function is particularly useful in converting Pandas DataFrames to JSON files. It allows you to convert and return a serialisable object that can be written as a JSON file.

The function has several parameters that offer customization of the JSON (JavaScript Object Notation) output. One of these is the orient parameter, which controls the format of the resulting JSON.

 

 

Syntax and Parameters of to_json

Here is the syntax for the to_json function:

DataFrame.to_json(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression='infer', index=True)

Let’s look at each of these parameters in detail:

  1. path_or_buf – A string path, path object, or file-like object. If not specified, the result is returned as a string.
  2. orient – The format of the resulting JSON. Options are ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, and ‘table’. A ValueError will be thrown if the orient is incorrect since others are not list-like.
  3. date_format – Controls the format of datetime objects. The default is ‘iso’, but you can also use ‘epoch’= epoch milliseconds
    ‘s’: second
    ‘ms’: millisecond
    ‘us’: microsecond
    ‘ns’ for nanosecond.
  4. double_precision – The number of decimal places to use when encoding floating point values.
  5. force_ascii – Force encoded string to be ASCII.
  6. date_unit – The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’ or ‘ns’.
  7. default_handler – Handler to call if an object cannot otherwise be converted to a suitable format for JSON. None will be converted to null and datetime objects will be converted to Unix timestamps.
  8. lines – If set to True, the JSON output will be in line delimited JSON format.
  9. compression – The type of compression to use. If not specified, compression is inferred from the filenames.
  10. index – Whether to include the DataFrame’s indices in the JSON output.

In the following sections, we’ll look at some examples of using to_json with different settings for these parameters.

 

Convert Pandas DataFrame to JSON file

Now, let’s look at an example of using to_json to convert a DataFrame to a JSON file.
First, we’ll need to create a DataFrame. For this tutorial, let’s use some sample data.

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
        'Age': [28, 24, 22],
        'City': ['New York', 'London', 'Bangkok']}
df = pd.DataFrame(data)
print(df)

The output of this code will be:

   Name  Age      City
0  John   28  New York
1  Anna   24    London
2  Peter  22   Bangkok

We have a DataFrame df with three columns – Name, Age, and City. Now let’s convert this DataFrame to a JSON file using the to_json function.

df.to_json('data.json')

This will write the DataFrame to a JSON file called ‘data.json’. The resulting JSON will look like this:

{
    "Name": {
        "0": "John",
        "1": "Anna",
        "2": "Peter"
    },
    "Age": {
        "0": 28,
        "1": 24,
        "2": 22
    },
    "City": {
        "0": "New York",
        "1": "London",
        "2": "Bangkok"
    }
}

In this output, each column name from the DataFrame becomes a key in the JSON file. The values under each key form another dictionary where the keys are the indices from the DataFrame and the values are the corresponding cell values.

 

Export to Compressed JSON

The compression parameter allows you to export your DataFrame to a compressed JSON file directly.
Here’s how you can do it:

df.to_json('compressed_data.json.gz', compression='gzip')

The above line of code writes the DataFrame to a gzipped JSON file called ‘compressed_data.json.gz’. Note that when the filename ends with ‘.gz’, Pandas infers that the data should be compressed using gzip, even if the compression argument isn’t explicitly set to ‘gzip’.
The compression parameter supports several types of compression including ‘gzip’, ‘bz2’, ‘zip’, ‘xz’. This parameter is useful when you’re working with large datasets that can be more efficiently stored in a compressed format.

 

Exploring Different Orientation Options

The orient parameter in the to_json function allows you to customize the format of the resulting JSON. Let’s explore the different options that it offers.
The possible options for orient are ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, and ‘table’. For the ‘split’, ‘records’, and ‘table’ options, the DataFrame indices are used.

Remember to use the same orientation when reading the JSON file using read_json.

 

split Orientation

The ‘split’ orientation is a special one. In this format, the data is split into different sections for the column headers, row indices, and data. This can be particularly useful if you need to reconstruct the DataFrame later as it clearly separates the key components of the DataFrame.
Let’s take a closer look at how the ‘split’ orientation works.

print(df.to_json(orient='split'))

Output:

{"columns":["Name","Age","City"],"index":[0,1,2],"data":[["John",28,"New York"],["Anna",24,"London"],["Peter",22,"Bangkok"]]}

In the ‘split’ format, the JSON output consists of three main parts:

  1. “columns”: This is a list of the DataFrame’s column names.
  2. “index”: This is a list of the DataFrame’s row indices.
  3. “data”: This is a list of lists, where each sub-list represents a row of data from the DataFrame.

This structure ensures that all the necessary information to reconstruct the original DataFrame is preserved.

The ‘split’ orientation is often used when storing DataFrame data for later usage due to this feature.

 

records Orientation

In this case, each row of the DataFrame is output as a separate JSON object within a list.
Here’s how you can use the ‘records’ orientation:

print(df.to_json(orient='records'))

Output:

[{"Name":"John","Age":28,"City":"New York"},{"Name":"Anna","Age":24,"City":"London"},{"Name":"Peter","Age":22,"City":"Bangkok"}]

In the ‘records’ format, the output JSON is a list of objects. Each object represents a row from the DataFrame, where each key-value pair corresponds to a column and its value in that row.
One advantage of the ‘records’ format is its readability.

 

index Orientation

In this orientation, each row of the DataFrame is presented as a JSON object, with the index of the DataFrame used as the key in the outer JSON object.
Here’s an example of the ‘index’ orientation:

print(df.to_json(orient='index'))

Output:

{"0":{"Name":"John","Age":28,"City":"New York"},"1":{"Name":"Anna","Age":24,"City":"London"},"2":{"Name":"Peter","Age":22,"City":"Bangkok"}}

In the ‘index’ format, each outer key represents an index from the DataFrame, and its corresponding value is a JSON object that contains column-value pairs representing the data in that row.
This format can be useful when the indices in your DataFrame carry significant information that needs to be maintained in the JSON structure.

For instance, if the indices represent unique identifiers for the records, the ‘index’ orientation preserves that connection.

 

columns Orientation

In this orientation, the JSON output is organized by column, with each column of the DataFrame becoming a key-value pair in the JSON object.
Let’s look at how the ‘columns’ orientation looks:

print(df.to_json(orient='columns'))

Output:

{"Name":{"0":"John","1":"Anna","2":"Peter"},"Age":{"0":28,"1":24,"2":22},"City":{"0":"New York","1":"London","2":"Bangkok"}}

In the ‘columns’ format, each key in the JSON object represents a column from the DataFrame, and the value is another JSON object. This inner object’s keys are the DataFrame’s indices, and the values are the corresponding cell values.
The ‘columns’ orientation can be especially useful when you want to work with data column by column.

 

values Orientation

In this orientation, the output is a JSON array containing the values of the DataFrame.

Here’s an example:

print(df.to_json(orient='columns'))

Output:

[["John",28,"New York"],["Anna",24,"London"],["Peter",22,"Bangkok"]]

As you can see, the output is a JSON array of arrays, where each sub-array represents a row of the DataFrame.

 

table Orientation

The ‘table’ orientation is a unique option that not only includes the data but also some meta-information about the DataFrame.

It includes details about the field name, type, and version of Pandas used.
Here’s how to use the ‘table’ orientation:

print(df.to_json(orient='table'))

Output:

{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"Name","type":"string"},{"name":"Age","type":"integer"},{"name":"City","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"Name":"John","Age":28,"City":"New York"},{"index":1,"Name":"Anna","Age":24,"City":"London"},{"index":2,"Name":"Peter","Age":22,"City":"Bangkok"}]}

In the ‘table’ format, the output JSON contains two main parts:

The ‘table’ format is useful when you want to keep track of the data types in your DataFrame, especially when sharing data between different systems or users where this meta-information can help maintain consistency and understanding.

 

Floating Point Numbers Precision

By default, Pandas uses a precision of up to 15 decimal places for floating-point numbers.
You can control the precision of floating-point numbers using the double_precision parameter of the to_json function.
Let’s create a DataFrame with floating point numbers and convert it to JSON with different precision levels:

import pandas as pd
data = {'A': [1.123456789], 'B': [2.123456789], 'C': [3.123456789]}
df = pd.DataFrame(data)
print(df.to_json(orient='split', double_precision=2))

Output:

{"columns":["A","B","C"],"index":[0],"data":[[1.12,2.12,3.12]]}

In the above example, the double_precision parameter is set to 2. This will limit the precision of the floating-point numbers in the output JSON to 2 decimal places.

Controlling the precision is useful when you want to reduce the file size or make the output more readable.

 

Handling Date Formats

You can use the date_format parameter in to_json function to format the datetime data into two different formats, ‘epoch’ and ‘iso’.

Let’s create a DataFrame with datetime data and convert it to JSON using both options:

import pandas as pd
from datetime import datetime
data = {'Date': [datetime(2022, 1, 1), datetime(2022, 2, 1), datetime(2022, 3, 1)]}
df = pd.DataFrame(data)

# Using the 'epoch' format
print(df.to_json(date_format='epoch'))

# Using the 'iso' format
print(df.to_json(date_format='iso'))

Output:

{"Date":{"0":1640995200000,"1":1643673600000,"2":1645910400000}}
{"Date":{"0":"2022-01-01T00:00:00.000Z","1":"2022-02-01T00:00:00.000Z","2":"2022-03-01T00:00:00.000Z"}}

As you can see when using ‘epoch’, the datetime objects are converted to UNIX timestamps.

However, with ‘iso’, the datetime objects are converted to a more readable ISO format.

 

Further Reading

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html

Leave a Reply

Your email address will not be published. Required fields are marked *