Merge Multiple JSON files in Python using Pandas

In this tutorial, you’ll learn how to merge JSON files using Python Pandas library, and how to handle any challenges during the process.

You’ll learn about different merging techniques like inner, outer, left, and right joins, tackle the challenges of duplicate values, and more.

 

 

Using concat()

To use concat(), you first need to read each of your JSON files into separate DataFrames using read_json() function.

Once you have your DataFrames, you can merge them using concat(). Here’s a step-by-step breakdown:

  1. Read Multiple JSON Files: Loop through your JSON files and read each one into a DataFrame.
  2. Concatenate DataFrames: Use pd.concat() to merge these DataFrames into one.

Here’s a sample code snippet to illustrate this process:

import pandas as pd
import glob
file_list = glob.glob('path/to/json_files/*.json')
data_frames = [pd.read_json(file) for file in file_list]
combined_df = pd.concat(data_frames)
combined_df.to_json('path/to/output/combined_data.json')
print(combined_df.head())

Output:

   column1  column2  column3
0       12       34       56
1       78       90       12
...

The combined_df.to_json() saves the concatenated JSON to a new JSON file at the specified location.

 

Using join()

Unlike concat(), which stacks DataFrames vertically or horizontally, join() is used for merging DataFrames based on their indexes or on a key column.

This is useful when your JSON files contain related data that you want to join in a structured way, much like a SQL join.

The process involves two main steps:

  1. Read JSON Files and Set Index: Read each JSON file into a DataFrame and set the appropriate column as the index if you plan to join on indexes.
  2. Join DataFrames: Use the join() method to merge these DataFrames. You can specify the type of join (e.g., ‘left’, ‘right’, ‘inner’, ‘outer’) depending on your requirements.

Here’s an example code snippet:

import pandas as pd
df1 = pd.read_json('path/to/first.json').set_index('key_column')
df2 = pd.read_json('path/to/second.json').set_index('key_column')
joined_df = df1.join(df2, how='inner')  # Adjust the join type as needed
print(joined_df.head())

Output:

           column1_df1  column2_df1  column1_df2  column2_df2
key_column                                                 
key1                45           67           89           10
key2                11           22           33           44
...

In this example, column1_df1, column2_df1, etc., represent column names from your DataFrames.

The join is performed on a common ‘key_column’, which is set as the index for both DataFrames.

 

Using merge()

Pandas provides the merge() function for combining DataFrames based on specific keys, similar to SQL join operations.

It allows various types of joins (like inner, outer, left, right) and enables you to specify one or more keys to merge on.

Here’s how you can use merge():

import pandas as pd
df1 = pd.read_json('path/to/first.json')
df2 = pd.read_json('path/to/second.json')
merged_df = pd.merge(df1, df2, on='key_column', how='inner')
print(merged_df.head())

Output:

   key_column  column1_df1  column2_df1  column1_df2  column2_df2
0        key1           23           45           67           89
1        key2           10           20           30           40
...

In this snippet, key_column is the column used to merge the DataFrames, and column1_df1, column2_df1, etc., represent the other columns in your DataFrames.

The how='inner' parameter performs an inner join, meaning only rows with matching keys in both DataFrames are included in the final result.

 

Understanding Types of Merge Joins

Each type of merge joins serves a specific purpose and can get different results depending on your data.

Here’s an overview of each type:

Inner Join

  • What It Does: Returns only the rows that have matching values in both DataFrames.
  • Use Case: Ideal when you want to retain only the rows that have corresponding data in both datasets.

Outer Join

  • What It Does: Returns all rows from both DataFrames, filling in NaNs where there are no matches.
  • Use Case: Useful when you want to retain all data from both datasets, regardless of whether there are matching entries.

Left Join

  • What It Does: Returns all rows from the left DataFrame and the matched rows from the right DataFrame. Non-matched rows from the right DataFrame are replaced with NaN.
  • Use Case: Suited for situations where you want all data from the left DataFrame, supplemented by corresponding data from the right DataFrame.

Right Join

  • What It Does: Returns all rows from the right DataFrame and the matched rows from the left DataFrame. Non-matched rows from the left DataFrame are replaced with NaN.
  • Use Case: Used when you want all data from the right DataFrame, enhanced by corresponding data from the left DataFrame.

Here’s a visual representation using sample code:

import pandas as pd
df_left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                        'A': ['A0', 'A1', 'A2', 'A3']})
df_right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                         'B': ['B0', 'B1', 'B4', 'B5']})
inner_join_df = pd.merge(df_left, df_right, on='key', how='inner')
outer_join_df = pd.merge(df_left, df_right, on='key', how='outer')
left_join_df = pd.merge(df_left, df_right, on='key', how='left')
right_join_df = pd.merge(df_left, df_right, on='key', how='right')
print("Inner Join:\n", inner_join_df)
print("Outer Join:\n", outer_join_df)
print("Left Join:\n", left_join_df)
print("Right Join:\n", right_join_df)

Output:

Inner Join:
   key   A   B
0  K0  A0  B0
1  K1  A1  B1
Outer Join:
   key    A    B
0  K0   A0   B0
1  K1   A1   B1
2  K2   A2  NaN
3  K3   A3  NaN
4  K4  NaN   B4
5  K5  NaN   B5
Left Join:
   key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2  NaN
3  K3  A3  NaN
Right Join:
   key    A   B
0  K0   A0  B0
1  K1   A1  B1
2  K4  NaN  B4
3  K5  NaN  B5

 

Handling Duplicate Values During the Merge

Duplicate values can arise in various scenarios, such as when datasets have overlapping records or when combining data from multiple sources.

Here are some strategies to handle duplicates during a merge:

Identifying Duplicates

Use DataFrame.duplicated() to identify duplicate rows based on specific columns or the entire DataFrame.

Removing Duplicates

Use DataFrame.drop_duplicates() to remove duplicate rows. You can specify columns to consider for identifying duplicates.

Deciding Which Duplicates to Keep

With drop_duplicates(), choose whether to keep the first occurrence (keep='first'), the last occurrence (keep='last'), or to drop all duplicates (keep=False).

Here’s a sample code to demonstrate handling duplicates:

import pandas as pd
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3'],
                    'A': ['A0', 'A1', 'A1', 'A3']})
df2 = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                    'B': ['B0', 'B1', 'B4', 'B5']})

df1 = df1.drop_duplicates(subset='key', keep='first')
merged_df = pd.merge(df1, df2, on='key', how='inner')
merged_df = merged_df.drop_duplicates(keep='first')
print(merged_df)

Output:

   key   A   B
0  K0  A0  B0
1  K1  A1  B1

In this example, duplicates in df1 based on the ‘key’ column are removed before the merge.

After merging, we again check for duplicates and remove them if found.

 

Flattening Nested JSON for Merging

Flattening these nested JSON objects is essential for merging them into a Pandas DataFrame.

Flattening nested data into a tabular format makes it easier to merge.

You can use json_normalize() function to flatten the nested JSON.

This function converts a list of nested dictionaries into a DataFrame. Specify the record path and meta parameters to control the level of flattening.

Here’s an example demonstrating flattening and merging:

import pandas as pd
from pandas.io.json import json_normalize
nested_json1 = [
    {"key": "K1", "data": {"A": 1, "B": 2}},
    {"key": "K2", "data": {"A": 3, "B": 4}}
]
nested_json2 = [
    {"key": "K1", "info": {"C": 5, "D": 6}},
    {"key": "K2", "info": {"C": 7, "D": 8}}
]
df1 = json_normalize(nested_json1, sep='_')
df2 = json_normalize(nested_json2, sep='_')
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)

Output:

  key  data_A  data_B  info_C  info_D
0  K1       1       2       5       6
1  K2       3       4       7       8

The sep='_' argument creates column names that reflect the nested hierarchy.

After flattening, the data is merged on the ‘key’ column.

Leave a Reply

Your email address will not be published. Required fields are marked *