Convert Python Pandas DataFrame to JSON using to_json
The to_json
function is particularly useful in converting Pandas DataFrames to JSON files. It allows you to convert and return a serialisable object that can be written as a JSON file.
The function has several parameters that offer customization of the JSON (JavaScript Object Notation) output. One of these is the orient
parameter, which controls the format of the resulting JSON.
Syntax and Parameters of to_json
Here is the syntax for the to_json
function:
DataFrame.to_json(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression='infer', index=True)
Let’s look at each of these parameters in detail:
- path_or_buf – A string path, path object, or file-like object. If not specified, the result is returned as a string.
- orient – The format of the resulting JSON. Options are ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, and ‘table’. A ValueError will be thrown if the orient is incorrect since others are not list-like.
- date_format – Controls the format of datetime objects. The default is ‘iso’, but you can also use ‘epoch’= epoch milliseconds
‘s’: second
‘ms’: millisecond
‘us’: microsecond
‘ns’ for nanosecond. - double_precision – The number of decimal places to use when encoding floating point values.
- force_ascii – Force encoded string to be ASCII.
- date_unit – The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’ or ‘ns’.
- default_handler – Handler to call if an object cannot otherwise be converted to a suitable format for JSON. None will be converted to null and datetime objects will be converted to Unix timestamps.
- lines – If set to
True
, the JSON output will be in line delimited JSON format. - compression – The type of compression to use. If not specified, compression is inferred from the filenames.
- index – Whether to include the DataFrame’s indices in the JSON output.
In the following sections, we’ll look at some examples of using to_json
with different settings for these parameters.
Convert Pandas DataFrame to JSON file
Now, let’s look at an example of using to_json
to convert a DataFrame to a JSON file.
First, we’ll need to create a DataFrame. For this tutorial, let’s use some sample data.
import pandas as pd data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 22], 'City': ['New York', 'London', 'Bangkok']} df = pd.DataFrame(data) print(df)
The output of this code will be:
Name Age City 0 John 28 New York 1 Anna 24 London 2 Peter 22 Bangkok
We have a DataFrame df
with three columns – Name, Age, and City. Now let’s convert this DataFrame to a JSON file using the to_json
function.
df.to_json('data.json')
This will write the DataFrame to a JSON file called ‘data.json’. The resulting JSON will look like this:
{ "Name": { "0": "John", "1": "Anna", "2": "Peter" }, "Age": { "0": 28, "1": 24, "2": 22 }, "City": { "0": "New York", "1": "London", "2": "Bangkok" } }
In this output, each column name from the DataFrame becomes a key in the JSON file. The values under each key form another dictionary where the keys are the indices from the DataFrame and the values are the corresponding cell values.
Export to Compressed JSON
The compression
parameter allows you to export your DataFrame to a compressed JSON file directly.
Here’s how you can do it:
df.to_json('compressed_data.json.gz', compression='gzip')
The above line of code writes the DataFrame to a gzipped JSON file called ‘compressed_data.json.gz’. Note that when the filename ends with ‘.gz’, Pandas infers that the data should be compressed using gzip, even if the compression
argument isn’t explicitly set to ‘gzip’.
The compression
parameter supports several types of compression including ‘gzip’, ‘bz2’, ‘zip’, ‘xz’. This parameter is useful when you’re working with large datasets that can be more efficiently stored in a compressed format.
Exploring Different Orientation Options
The orient
parameter in the to_json
function allows you to customize the format of the resulting JSON. Let’s explore the different options that it offers.
The possible options for orient
are ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, and ‘table’. For the ‘split’, ‘records’, and ‘table’ options, the DataFrame indices are used.
Remember to use the same orientation when reading the JSON file using read_json.
split Orientation
The ‘split’ orientation is a special one. In this format, the data is split into different sections for the column headers, row indices, and data. This can be particularly useful if you need to reconstruct the DataFrame later as it clearly separates the key components of the DataFrame.
Let’s take a closer look at how the ‘split’ orientation works.
print(df.to_json(orient='split'))
Output:
{"columns":["Name","Age","City"],"index":[0,1,2],"data":[["John",28,"New York"],["Anna",24,"London"],["Peter",22,"Bangkok"]]}
In the ‘split’ format, the JSON output consists of three main parts:
- “columns”: This is a list of the DataFrame’s column names.
- “index”: This is a list of the DataFrame’s row indices.
- “data”: This is a list of lists, where each sub-list represents a row of data from the DataFrame.
This structure ensures that all the necessary information to reconstruct the original DataFrame is preserved.
The ‘split’ orientation is often used when storing DataFrame data for later usage due to this feature.
records Orientation
In this case, each row of the DataFrame is output as a separate JSON object within a list.
Here’s how you can use the ‘records’ orientation:
print(df.to_json(orient='records'))
Output:
[{"Name":"John","Age":28,"City":"New York"},{"Name":"Anna","Age":24,"City":"London"},{"Name":"Peter","Age":22,"City":"Bangkok"}]
In the ‘records’ format, the output JSON is a list of objects. Each object represents a row from the DataFrame, where each key-value pair corresponds to a column and its value in that row.
One advantage of the ‘records’ format is its readability.
index Orientation
In this orientation, each row of the DataFrame is presented as a JSON object, with the index of the DataFrame used as the key in the outer JSON object.
Here’s an example of the ‘index’ orientation:
print(df.to_json(orient='index'))
Output:
{"0":{"Name":"John","Age":28,"City":"New York"},"1":{"Name":"Anna","Age":24,"City":"London"},"2":{"Name":"Peter","Age":22,"City":"Bangkok"}}
In the ‘index’ format, each outer key represents an index from the DataFrame, and its corresponding value is a JSON object that contains column-value pairs representing the data in that row.
This format can be useful when the indices in your DataFrame carry significant information that needs to be maintained in the JSON structure.
For instance, if the indices represent unique identifiers for the records, the ‘index’ orientation preserves that connection.
columns Orientation
In this orientation, the JSON output is organized by column, with each column of the DataFrame becoming a key-value pair in the JSON object.
Let’s look at how the ‘columns’ orientation looks:
print(df.to_json(orient='columns'))
Output:
{"Name":{"0":"John","1":"Anna","2":"Peter"},"Age":{"0":28,"1":24,"2":22},"City":{"0":"New York","1":"London","2":"Bangkok"}}
In the ‘columns’ format, each key in the JSON object represents a column from the DataFrame, and the value is another JSON object. This inner object’s keys are the DataFrame’s indices, and the values are the corresponding cell values.
The ‘columns’ orientation can be especially useful when you want to work with data column by column.
values Orientation
In this orientation, the output is a JSON array containing the values of the DataFrame.
Here’s an example:
print(df.to_json(orient='columns'))
Output:
[["John",28,"New York"],["Anna",24,"London"],["Peter",22,"Bangkok"]]
As you can see, the output is a JSON array of arrays, where each sub-array represents a row of the DataFrame.
table Orientation
The ‘table’ orientation is a unique option that not only includes the data but also some meta-information about the DataFrame.
It includes details about the field name, type, and version of Pandas used.
Here’s how to use the ‘table’ orientation:
print(df.to_json(orient='table'))
Output:
{"schema": {"fields":[{"name":"index","type":"integer"},{"name":"Name","type":"string"},{"name":"Age","type":"integer"},{"name":"City","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"}, "data": [{"index":0,"Name":"John","Age":28,"City":"New York"},{"index":1,"Name":"Anna","Age":24,"City":"London"},{"index":2,"Name":"Peter","Age":22,"City":"Bangkok"}]}
In the ‘table’ format, the output JSON contains two main parts:
The ‘table’ format is useful when you want to keep track of the data types in your DataFrame, especially when sharing data between different systems or users where this meta-information can help maintain consistency and understanding.
Floating Point Numbers Precision
By default, Pandas uses a precision of up to 15 decimal places for floating-point numbers.
You can control the precision of floating-point numbers using the double_precision
parameter of the to_json
function.
Let’s create a DataFrame with floating point numbers and convert it to JSON with different precision levels:
import pandas as pd data = {'A': [1.123456789], 'B': [2.123456789], 'C': [3.123456789]} df = pd.DataFrame(data) print(df.to_json(orient='split', double_precision=2))
Output:
{"columns":["A","B","C"],"index":[0],"data":[[1.12,2.12,3.12]]}
In the above example, the double_precision
parameter is set to 2. This will limit the precision of the floating-point numbers in the output JSON to 2 decimal places.
Controlling the precision is useful when you want to reduce the file size or make the output more readable.
Handling Date Formats
You can use the date_format
parameter in to_json
function to format the datetime data into two different formats, ‘epoch’ and ‘iso’.
Let’s create a DataFrame with datetime data and convert it to JSON using both options:
import pandas as pd from datetime import datetime data = {'Date': [datetime(2022, 1, 1), datetime(2022, 2, 1), datetime(2022, 3, 1)]} df = pd.DataFrame(data) # Using the 'epoch' format print(df.to_json(date_format='epoch')) # Using the 'iso' format print(df.to_json(date_format='iso'))
Output:
{"Date":{"0":1640995200000,"1":1643673600000,"2":1645910400000}} {"Date":{"0":"2022-01-01T00:00:00.000Z","1":"2022-02-01T00:00:00.000Z","2":"2022-03-01T00:00:00.000Z"}}
As you can see when using ‘epoch’, the datetime objects are converted to UNIX timestamps.
However, with ‘iso’, the datetime objects are converted to a more readable ISO format.
Further Reading
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.