Read JSON files using Python Pandas read_json

Pandas read_json function allows you to read JSON files and convert them to a Pandas DataFrame.

In this tutorial, you’ll learn how to load JSON files, handle various JSON formats, customize column labels, infer dtypes, parse dates, and more using Python’s Pandas read_json function.

 

 

JSON structure and syntax

Before we dive into Pandas’ read_json() function, let’s briefly review the structure and syntax of JSON.

[
  {
    "name": "John",
    "age": 44,
    "city": "New York"
  },
  {
    "name": "Alice",
    "age": 25,
    "city": "San Francisco"
  }
]

It consists of key-value pairs and supports various data types such as strings, numbers, booleans, arrays, and objects.

 

Read JSON file

To start reading a JSON file, you can simply provide the file path to the read_json() function.

Let’s consider an example where we have a JSON file called “data.json” with the following content:

[
  {
    "name": "John",
    "age": 44,
    "city": "New York"
  },
  {
    "name": "Alice",
    "age": 25,
    "city": "San Francisco"
  }
]

Now, let’s load this JSON file into a DataFrame using the read_json function:

import pandas as pd
df = pd.read_json('data.json')
print(df)

Output:

   name  age           city
0  John   44       New York
1 Alice   25  San Francisco

 

Read JSON from a URL or API

Let’s consider an example where we have a URL that returns the following JSON data:

{
  "name": "John",
  "age": 44,
  "city": "New York"
}

To read this JSON data from the URL, you can use the read_json() function as follows:

import pandas as pd
url = 'https://example.com/data.json'
df = pd.read_json(url)
print(df)

Output:

   name  age       city
0  John   44  New York

In the code above, we specify the URL of the JSON data and pass it to the read_json() function.

 

Read compressed JSON

Sometimes, JSON data is compressed to reduce the file size or optimize data transfer.

Pandas read_json() function supports reading compressed JSON files directly without the need to decompress them manually.
Let’s say we have a compressed JSON file called “data.json.gz” with the following content:

[
  {
    "name": "John",
    "age": 44,
    "city": "New York"
  },
  {
    "name": "Alice",
    "age": 25,
    "city": "San Francisco"
  }
]

To read this compressed JSON file, you can use the read_json() function with the compression parameter set to the appropriate compression method.

For gzip-compressed files, we set compression='gzip':

import pandas as pd
df = pd.read_json('data.json.gz', compression='gzip')
print(df)

Output:

   name  age           city
0  John   44       New York
1 Alice   25  San Francisco

In the above code, we pass the file path of the compressed JSON file, ‘data.json.gz’, to the read_json() function.

Then, we specify compression='gzip' to indicate that the file is gzip-compressed. The function automatically handles the decompression and loads the JSON data into a DataFrame.

 

Read nested JSON

Let’s consider an example where we have the following nested JSON data:

{
  "name": "John",
  "age": 44,
  "address": {
    "city": "New York",
    "country": "USA"
  }
}

To read this nested JSON data, we can use the read_json() function as follows:

import json
with open('nested_sample.json') as f:
    data = json.load(f)
df = pd.json_normalize(data)
print(df)

Output:

   name  age address.city address.country
0  John   44     New York             USA

In this example, we load the data using json.load, then we used json_normalize to flatten the nested JSON.

Another example of nested JSON where the JSON key holds a Python dictionary:

{
    "employees": [
        { "firstName": "John", "lastName": "Doe" },
        { "firstName": "Adam", "lastName": "Smith" },
        { "firstName": "Peter", "lastName": "Jones" }
    ]
}

You can handle the original structure by using the json_normalize function from Pandas, like this:

import json
import pandas as pd

with open('sample.json') as f:
    data = json.load(f)
df = pd.json_normalize(data,'employees')
print(df)

Output:

  firstName lastName
0      John      Doe
1      Adam    Smith
2     Peter    Jones

The json_normalize function is used to flatten the data structure and transform the nested dictionaries into a DataFrame with separate columns for ‘firstName’ and ‘lastName’.

 

Customizing Column Names

By default, Pandas’ read_json() function uses the keys in the JSON objects as column names in the resulting DataFrame.

However, you can customize the column names by renaming the columns of the DataFrame.
Let’s consider an example where we have the following JSON data:

{
  "firstname": "John",
  "lastname": "Doe",
  "age": 44
}

The code will look like this:

import pandas as pd
df = pd.read_json('sample.json')
df.columns = ['First Name', 'Last Name', 'Age']
print(df)

Output:

  First Name Last Name  Age
0       John       Doe   44

 

Understanding orientations

The read_json() function in Pandas provides various orientations for interpreting different JSON data structures.

The orient parameter allows you to specify the desired orientation. Let’s explore the available orientations and their use cases:

split

The 'split' orientation interprets the JSON data as a dictionary of lists, where each list corresponds to a column in the resulting DataFrame.
Consider the following JSON data:

{
  "name": ["John", "Alice"],
  "age": [44, 25],
  "city": ["New York", "San Francisco"]
}

To read this JSON data using the 'split' orientation, we can use the read_json() function as follows:

import pandas as pd
df = pd.read_json('data.json', orient='split')
print(df)

Output:

   name  age           city
0  John   44       New York
1 Alice   25  San Francisco

 

records

The 'records' orientation interprets each element in the JSON data as a separate record, represented by a dictionary.
Consider the following JSON data in a file:

[
  {"name": "John", "age": 44},
  {"name": "Alice", "age": 25}
]

To read this JSON data using the 'records' orientation, we can use the read_json() function as follows:

import pandas as pd
df = pd.read_json('data.json', orient='records')
print(df)

Output:

   name  age
0  John   44
1 Alice   25

 

index

The 'index' orientation treats the JSON data as a dictionary of dictionaries, where the outer dictionary keys correspond to row labels, and the inner dictionary keys correspond to column names.
Consider the following JSON data:

{
  "row1": {"name": "John", "age": 44},
  "row2": {"name": "Alice", "age": 25}
}

To read this JSON data using the 'index' orientation, we can use the read_json() function as follows:

import pandas as pd
df = pd.read_json('data.json', orient='index')
print(df)

Output:

      name  age
row1  John   44
row2 Alice   25

 

columns

The 'columns' orientation treats the JSON data as a dictionary of dictionaries, where the outer dictionary keys correspond to column names this time, and the inner dictionary keys correspond to row labels.
Consider the following JSON data:

{
  "name": {"row1": "John", "row2": "Alice"},
  "age": {"row1": 44, "row2": 25}
}

To read this JSON data using the 'columns' orientation, we can use the read_json() function as follows:

import pandas as pd
df = pd.read_json('data.json', orient='columns')
print(df)

Output:

      name  age
row1  John   44
row2 Alice   25

 

values

The 'values' orientation treats the JSON data as a list of lists, where each list represents a row in the DataFrame.
Consider the following JSON data:

[
    ["John", 44],
    ["Alice", 25]
]

To read this JSON data using the 'values' orientation, we can use the read_json() function as follows:

import pandas as pd
df = pd.read_json('data.json', orient='values')
print(df)

Output:

   name  age
0  John   44
1 Alice   25

 

Data Types

By default, Pandas’ read_json() function infers the data types of the columns based on the JSON data. However, you can control the data types by using the dtype parameter.
Let’s consider an example where we have the following JSON data:

[
  {
    "name": "John",
    "age": "44",
    "city": "New York"
  },
  {
    "name": "Alice",
    "age": "25",
    "city": "San Francisco"
  }
]

To read this JSON data and specify the data types for the columns, we can use the read_json() function as follows:

import pandas as pd
newtypes = {"name": str, "age": int, "city": str}
df = pd.read_json('data.json', dtype=newtypes)
print(df.dtypes)

Output:

name    object
age      int32
city    object
dtype: object

We provide a dictionary, newtypes, which specifies the desired data types for the columns.

 

Handle Dates using convert_dates

The convert_dates parameter allows you to convert date strings to Pandas date objects.

Let’s consider an example where we have the following JSON data:

[
  {
    "name": "John",
    "birthdate": "1990-05-20",
    "last_login": "2022-01-01 10:30:00"
  },
  {
    "name": "Alice",
    "birthdate": "1995-02-15",
    "last_login": "2022-02-05 14:45:00"
  }
]

To convert the date and datetime strings into pandas’ datetime objects, we can use the read_json() function as follows:

import pandas as pd
df = pd.read_json('data.json', convert_dates=['birthdate', 'last_login'])
print(df.dtypes)

Output:

name                  object
birthdate     datetime64[ns]
last_login    datetime64[ns]
dtype: object

In the code above, the file contains a Python list of dictionaries.

We specify convert_dates=['birthdate', 'last_login'] to indicate the columns that should be converted into pandas’ datetime objects.

 

Choose the proper Encoding

If the JSON files you are trying to read are encoded in a different encoding other than UTF-8, you should set the encoding parameter to the proper encoding to be able to read the data correctly.

import pandas as pd
try:
    df = pd.read_json('arabic_data.json')
    print(df.head())
except Exception as e:
    print(f"An error occurred: {e}")

If you get an error from the above code, that means you need to choose the correct encoding for your files.

You can use the chardet module to detect the file encoding like this:

import chardet
def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
    return result['encoding']

file_path = 'arabic_data.json' 
encoding = detect_encoding(file_path)
print(f'The detected encoding of the file is: {encoding}')

Please note that in some cases, chardet can sometimes return None or incorrect encoding if it is unable to detect the encoding. In that case, you should contact the owner of the files.

After getting the file encoding, you can read the file with the correct encoding like this:

import pandas as pd
df = pd.read_json('arabic_data.json', encoding='Windows-1256')
print(df.head())

 

Reading large JSON files

When dealing with large JSON files, setting the lines parameter to True can be beneficial. It allows reading JSON objects line by line, which is memory-efficient and helps process large datasets that don’t fit into memory.
Let’s consider an example where we have a large JSON file with each line representing a separate JSON object:

{"name": "John", "age": 44, "city": "New York"}
{"name": "Jane", "age": 32, "city": "Chicago"}
{"name": "Tom", "age": 35, "city": "San Francisco"}
{"name": "Alice", "age": 25, "city": "Boston"}
{"name": "Bob", "age": 27, "city": "Los Angeles"}
{"name": "Charlie", "age": 34, "city": "Seattle"}
{"name": "Dave", "age": 31, "city": "Austin"}
{"name": "Eve", "age": 33, "city": "Denver"}
{"name": "Frank", "age": 29, "city": "Dallas"}
{"name": "Grace", "age": 28, "city": "Miami"}
...

To read this large JSON file line by line, we can use the read_json() function with lines=Trueand set the chunksize=1 to read one record each time.

In real-world applications,  the chunksize can be 500 or so.

import pandas as pd
start = 2
end = 5
total = 0
json_reader = pd.read_json('sample.json', lines=True, chunksize=1)
result = pd.DataFrame()

for chunk in json_reader:
    if total >= start and total < end:
        result = pd.concat([result, chunk])
    total += 1
    if total == end:
        break
print(result)

Output:

    name  age           city
2    Tom   35  San Francisco
3  Alice   25         Boston
4    Bob   27    Los Angeles

In the above code, we start to read from the record with index 2 to the record with index 5, and each time the read_json function reads one record according to the chunksize we set.

Then we concatenate the result chunks into a DataFrame.

You can adjust this number based on the size of your data and your available memory.

Note: This will work only if each line in your JSON file represents a valid JSON object.

 

Limit rows using nrows

When dealing with large JSON files, you may not always need to load the entire dataset into memory. The nrows parameter allows you to specify the number of rows to read from the JSON file.

Let’s say we have a large JSON file with thousands of rows, but we only want to read the first 3 rows. We can achieve this by setting nrows=3:

import pandas as pd
df = pd.read_json('data.json', lines=True, nrows=3)
print(df)

Output:

   name  age           city
0  John   44       New York
1  Jane   32        Chicago
2   Tom   35  San Francisco

In the code snippet above, we read the JSON file “data.json” and specify nrows=3 to read only the first 3 rows. The resulting DataFrame, df, contains the specified number of rows from the JSON file.

Note: To use the nrows parameter, you need to set the lines parameter to true.

 

At The Speed of Light

I was hired by a large financial services company that had been suffering from an inefficiency problem related to their data processing application.

The application was designed to analyze a constant influx of transaction data from several different banking partners, which came in the form of large JSON files.

The files contained data such as transaction times, amounts, categories, and merchant details. The data was crucial for the company’s risk analysis and fraud detection modules.

The issue was that the application, which used an older version of Pandas, was having difficulty handling the sheer volume of data, especially at peak times.

It was using the read_json function to load the entire data into a DataFrame in one go.

At peak times, the size of the data could reach up to 10 GBs. This was causing a substantial delay, leading to slower risk assessment and consequently, potential revenue loss.

After analyzing the situation, I realized that the application was running on Pandas version 0.20.3, which was not the most efficient for their use case.

The first step I took was to upgrade Pandas to the latest version available at the time, which was 1.3.0. But, the significant performance boost came when I started implementing chunking with read_json. The real pain was when upgrading the other modules to make everything work with the newer version of Pandas.

By modifying the code to load the data in chunks, we were able to drastically reduce the load time.

This change led to a dramatic decrease in the time, an 83% decrease in loading time!

Leave a Reply

Your email address will not be published. Required fields are marked *