Read JSON files using Python Pandas read_json
Pandas read_json function allows you to read JSON files and convert them to a Pandas DataFrame.
In this tutorial, you’ll learn how to load JSON files, handle various JSON formats, customize column labels, infer dtypes, parse dates, and more using Python’s Pandas read_json
function.
- 1 JSON structure and syntax
- 2 Read JSON file
- 3 Read JSON from a URL or API
- 4 Read compressed JSON
- 5 Read nested JSON
- 6 Customizing Column Names
- 7 Understanding orientations
- 8 Data Types
- 9 Handle Dates using convert_dates
- 10 Choose the proper Encoding
- 11 Reading large JSON files
- 12 Limit rows using nrows
- 13 At The Speed of Light
JSON structure and syntax
Before we dive into Pandas’ read_json()
function, let’s briefly review the structure and syntax of JSON.
[ { "name": "John", "age": 44, "city": "New York" }, { "name": "Alice", "age": 25, "city": "San Francisco" } ]
It consists of key-value pairs and supports various data types such as strings, numbers, booleans, arrays, and objects.
Read JSON file
To start reading a JSON file, you can simply provide the file path to the read_json()
function.
Let’s consider an example where we have a JSON file called “data.json” with the following content:
[ { "name": "John", "age": 44, "city": "New York" }, { "name": "Alice", "age": 25, "city": "San Francisco" } ]
Now, let’s load this JSON file into a DataFrame using the read_json
function:
import pandas as pd df = pd.read_json('data.json') print(df)
Output:
name age city 0 John 44 New York 1 Alice 25 San Francisco
Read JSON from a URL or API
Let’s consider an example where we have a URL that returns the following JSON data:
{ "name": "John", "age": 44, "city": "New York" }
To read this JSON data from the URL, you can use the read_json()
function as follows:
import pandas as pd url = 'https://example.com/data.json' df = pd.read_json(url) print(df)
Output:
name age city 0 John 44 New York
In the code above, we specify the URL of the JSON data and pass it to the read_json()
function.
Read compressed JSON
Sometimes, JSON data is compressed to reduce the file size or optimize data transfer.
Pandas read_json()
function supports reading compressed JSON files directly without the need to decompress them manually.
Let’s say we have a compressed JSON file called “data.json.gz” with the following content:
[ { "name": "John", "age": 44, "city": "New York" }, { "name": "Alice", "age": 25, "city": "San Francisco" } ]
To read this compressed JSON file, you can use the read_json()
function with the compression
parameter set to the appropriate compression method.
For gzip-compressed files, we set compression='gzip'
:
import pandas as pd df = pd.read_json('data.json.gz', compression='gzip') print(df)
Output:
name age city 0 John 44 New York 1 Alice 25 San Francisco
In the above code, we pass the file path of the compressed JSON file, ‘data.json.gz’, to the read_json()
function.
Then, we specify compression='gzip'
to indicate that the file is gzip-compressed. The function automatically handles the decompression and loads the JSON data into a DataFrame.
Read nested JSON
Let’s consider an example where we have the following nested JSON data:
{ "name": "John", "age": 44, "address": { "city": "New York", "country": "USA" } }
To read this nested JSON data, we can use the read_json()
function as follows:
import json with open('nested_sample.json') as f: data = json.load(f) df = pd.json_normalize(data) print(df)
Output:
name age address.city address.country 0 John 44 New York USA
In this example, we load the data using json.load
, then we used json_normalize
to flatten the nested JSON.
Another example of nested JSON where the JSON key holds a Python dictionary:
{ "employees": [ { "firstName": "John", "lastName": "Doe" }, { "firstName": "Adam", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }
You can handle the original structure by using the json_normalize
function from Pandas, like this:
import json import pandas as pd with open('sample.json') as f: data = json.load(f) df = pd.json_normalize(data,'employees') print(df)
Output:
firstName lastName 0 John Doe 1 Adam Smith 2 Peter Jones
The json_normalize
function is used to flatten the data structure and transform the nested dictionaries into a DataFrame with separate columns for ‘firstName’ and ‘lastName’.
Customizing Column Names
By default, Pandas’ read_json()
function uses the keys in the JSON objects as column names in the resulting DataFrame.
However, you can customize the column names by renaming the columns of the DataFrame.
Let’s consider an example where we have the following JSON data:
{ "firstname": "John", "lastname": "Doe", "age": 44 }
The code will look like this:
import pandas as pd df = pd.read_json('sample.json') df.columns = ['First Name', 'Last Name', 'Age'] print(df)
Output:
First Name Last Name Age 0 John Doe 44
Understanding orientations
The read_json()
function in Pandas provides various orientations for interpreting different JSON data structures.
The orient
parameter allows you to specify the desired orientation. Let’s explore the available orientations and their use cases:
split
The 'split'
orientation interprets the JSON data as a dictionary of lists, where each list corresponds to a column in the resulting DataFrame.
Consider the following JSON data:
{ "name": ["John", "Alice"], "age": [44, 25], "city": ["New York", "San Francisco"] }
To read this JSON data using the 'split'
orientation, we can use the read_json()
function as follows:
import pandas as pd df = pd.read_json('data.json', orient='split') print(df)
Output:
name age city 0 John 44 New York 1 Alice 25 San Francisco
records
The 'records'
orientation interprets each element in the JSON data as a separate record, represented by a dictionary.
Consider the following JSON data in a file:
[ {"name": "John", "age": 44}, {"name": "Alice", "age": 25} ]
To read this JSON data using the 'records'
orientation, we can use the read_json()
function as follows:
import pandas as pd df = pd.read_json('data.json', orient='records') print(df)
Output:
name age 0 John 44 1 Alice 25
index
The 'index'
orientation treats the JSON data as a dictionary of dictionaries, where the outer dictionary keys correspond to row labels, and the inner dictionary keys correspond to column names.
Consider the following JSON data:
{ "row1": {"name": "John", "age": 44}, "row2": {"name": "Alice", "age": 25} }
To read this JSON data using the 'index'
orientation, we can use the read_json()
function as follows:
import pandas as pd df = pd.read_json('data.json', orient='index') print(df)
Output:
name age row1 John 44 row2 Alice 25
columns
The 'columns'
orientation treats the JSON data as a dictionary of dictionaries, where the outer dictionary keys correspond to column names this time, and the inner dictionary keys correspond to row labels.
Consider the following JSON data:
{ "name": {"row1": "John", "row2": "Alice"}, "age": {"row1": 44, "row2": 25} }
To read this JSON data using the 'columns'
orientation, we can use the read_json()
function as follows:
import pandas as pd df = pd.read_json('data.json', orient='columns') print(df)
Output:
name age row1 John 44 row2 Alice 25
values
The 'values'
orientation treats the JSON data as a list of lists, where each list represents a row in the DataFrame.
Consider the following JSON data:
[ ["John", 44], ["Alice", 25] ]
To read this JSON data using the 'values'
orientation, we can use the read_json()
function as follows:
import pandas as pd df = pd.read_json('data.json', orient='values') print(df)
Output:
name age 0 John 44 1 Alice 25
Data Types
By default, Pandas’ read_json()
function infers the data types of the columns based on the JSON data. However, you can control the data types by using the dtype
parameter.
Let’s consider an example where we have the following JSON data:
[ { "name": "John", "age": "44", "city": "New York" }, { "name": "Alice", "age": "25", "city": "San Francisco" } ]
To read this JSON data and specify the data types for the columns, we can use the read_json()
function as follows:
import pandas as pd newtypes = {"name": str, "age": int, "city": str} df = pd.read_json('data.json', dtype=newtypes) print(df.dtypes)
Output:
name object age int32 city object dtype: object
We provide a dictionary, newtypes
, which specifies the desired data types for the columns.
Handle Dates using convert_dates
The convert_dates
parameter allows you to convert date strings to Pandas date objects.
Let’s consider an example where we have the following JSON data:
[ { "name": "John", "birthdate": "1990-05-20", "last_login": "2022-01-01 10:30:00" }, { "name": "Alice", "birthdate": "1995-02-15", "last_login": "2022-02-05 14:45:00" } ]
To convert the date and datetime strings into pandas’ datetime objects, we can use the read_json()
function as follows:
import pandas as pd df = pd.read_json('data.json', convert_dates=['birthdate', 'last_login']) print(df.dtypes)
Output:
name object birthdate datetime64[ns] last_login datetime64[ns] dtype: object
In the code above, the file contains a Python list of dictionaries.
We specify convert_dates=['birthdate', 'last_login']
to indicate the columns that should be converted into pandas’ datetime objects.
Choose the proper Encoding
If the JSON files you are trying to read are encoded in a different encoding other than UTF-8, you should set the encoding parameter to the proper encoding to be able to read the data correctly.
import pandas as pd try: df = pd.read_json('arabic_data.json') print(df.head()) except Exception as e: print(f"An error occurred: {e}")
If you get an error from the above code, that means you need to choose the correct encoding for your files.
You can use the chardet module to detect the file encoding like this:
import chardet def detect_encoding(file_path): with open(file_path, 'rb') as f: result = chardet.detect(f.read()) return result['encoding'] file_path = 'arabic_data.json' encoding = detect_encoding(file_path) print(f'The detected encoding of the file is: {encoding}')
Please note that in some cases, chardet
can sometimes return None
or incorrect encoding if it is unable to detect the encoding. In that case, you should contact the owner of the files.
After getting the file encoding, you can read the file with the correct encoding like this:
import pandas as pd df = pd.read_json('arabic_data.json', encoding='Windows-1256') print(df.head())
Reading large JSON files
When dealing with large JSON files, setting the lines
parameter to True
can be beneficial. It allows reading JSON objects line by line, which is memory-efficient and helps process large datasets that don’t fit into memory.
Let’s consider an example where we have a large JSON file with each line representing a separate JSON object:
{"name": "John", "age": 44, "city": "New York"} {"name": "Jane", "age": 32, "city": "Chicago"} {"name": "Tom", "age": 35, "city": "San Francisco"} {"name": "Alice", "age": 25, "city": "Boston"} {"name": "Bob", "age": 27, "city": "Los Angeles"} {"name": "Charlie", "age": 34, "city": "Seattle"} {"name": "Dave", "age": 31, "city": "Austin"} {"name": "Eve", "age": 33, "city": "Denver"} {"name": "Frank", "age": 29, "city": "Dallas"} {"name": "Grace", "age": 28, "city": "Miami"} ...
To read this large JSON file line by line, we can use the read_json()
function with lines=True
and set the chunksize=1
to read one record each time.
In real-world applications, the chunksize can be 500 or so.
import pandas as pd start = 2 end = 5 total = 0 json_reader = pd.read_json('sample.json', lines=True, chunksize=1) result = pd.DataFrame() for chunk in json_reader: if total >= start and total < end: result = pd.concat([result, chunk]) total += 1 if total == end: break print(result)
Output:
name age city 2 Tom 35 San Francisco 3 Alice 25 Boston 4 Bob 27 Los Angeles
In the above code, we start to read from the record with index 2 to the record with index 5, and each time the read_json
function reads one record according to the chunksize we set.
Then we concatenate the result chunks into a DataFrame.
You can adjust this number based on the size of your data and your available memory.
Note: This will work only if each line in your JSON file represents a valid JSON object.
Limit rows using nrows
When dealing with large JSON files, you may not always need to load the entire dataset into memory. The nrows
parameter allows you to specify the number of rows to read from the JSON file.
Let’s say we have a large JSON file with thousands of rows, but we only want to read the first 3 rows. We can achieve this by setting nrows=3
:
import pandas as pd df = pd.read_json('data.json', lines=True, nrows=3) print(df)
Output:
name age city 0 John 44 New York 1 Jane 32 Chicago 2 Tom 35 San Francisco
In the code snippet above, we read the JSON file “data.json” and specify nrows=3
to read only the first 3 rows. The resulting DataFrame, df
, contains the specified number of rows from the JSON file.
Note: To use the nrows
parameter, you need to set the lines
parameter to true.
At The Speed of Light
I was hired by a large financial services company that had been suffering from an inefficiency problem related to their data processing application.
The application was designed to analyze a constant influx of transaction data from several different banking partners, which came in the form of large JSON files.
The files contained data such as transaction times, amounts, categories, and merchant details. The data was crucial for the company’s risk analysis and fraud detection modules.
The issue was that the application, which used an older version of Pandas, was having difficulty handling the sheer volume of data, especially at peak times.
It was using the read_json
function to load the entire data into a DataFrame in one go.
At peak times, the size of the data could reach up to 10 GBs. This was causing a substantial delay, leading to slower risk assessment and consequently, potential revenue loss.
After analyzing the situation, I realized that the application was running on Pandas version 0.20.3, which was not the most efficient for their use case.
The first step I took was to upgrade Pandas to the latest version available at the time, which was 1.3.0. But, the significant performance boost came when I started implementing chunking with read_json
. The real pain was when upgrading the other modules to make everything work with the newer version of Pandas.
By modifying the code to load the data in chunks, we were able to drastically reduce the load time.
This change led to a dramatic decrease in the time, an 83% decrease in loading time!
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.