Merge Multiple JSON files in Python using Pandas
In this tutorial, you’ll learn how to merge JSON files using Python Pandas library, and how to handle any challenges during the process.
You’ll learn about different merging techniques like inner, outer, left, and right joins, tackle the challenges of duplicate values, and more.
Using concat()
To use concat()
, you first need to read each of your JSON files into separate DataFrames using read_json()
function.
Once you have your DataFrames, you can merge them using concat()
. Here’s a step-by-step breakdown:
- Read Multiple JSON Files: Loop through your JSON files and read each one into a DataFrame.
- Concatenate DataFrames: Use
pd.concat()
to merge these DataFrames into one.
Here’s a sample code snippet to illustrate this process:
import pandas as pd import glob file_list = glob.glob('path/to/json_files/*.json') data_frames = [pd.read_json(file) for file in file_list] combined_df = pd.concat(data_frames) combined_df.to_json('path/to/output/combined_data.json') print(combined_df.head())
Output:
column1 column2 column3 0 12 34 56 1 78 90 12 ...
The combined_df.to_json()
saves the concatenated JSON to a new JSON file at the specified location.
Using join()
Unlike concat()
, which stacks DataFrames vertically or horizontally, join()
is used for merging DataFrames based on their indexes or on a key column.
This is useful when your JSON files contain related data that you want to join in a structured way, much like a SQL join.
The process involves two main steps:
- Read JSON Files and Set Index: Read each JSON file into a DataFrame and set the appropriate column as the index if you plan to join on indexes.
- Join DataFrames: Use the
join()
method to merge these DataFrames. You can specify the type of join (e.g., ‘left’, ‘right’, ‘inner’, ‘outer’) depending on your requirements.
Here’s an example code snippet:
import pandas as pd df1 = pd.read_json('path/to/first.json').set_index('key_column') df2 = pd.read_json('path/to/second.json').set_index('key_column') joined_df = df1.join(df2, how='inner') # Adjust the join type as needed print(joined_df.head())
Output:
column1_df1 column2_df1 column1_df2 column2_df2 key_column key1 45 67 89 10 key2 11 22 33 44 ...
In this example, column1_df1
, column2_df1
, etc., represent column names from your DataFrames.
The join is performed on a common ‘key_column’, which is set as the index for both DataFrames.
Using merge()
Pandas provides the merge()
function for combining DataFrames based on specific keys, similar to SQL join operations.
It allows various types of joins (like inner, outer, left, right) and enables you to specify one or more keys to merge on.
Here’s how you can use merge()
:
import pandas as pd df1 = pd.read_json('path/to/first.json') df2 = pd.read_json('path/to/second.json') merged_df = pd.merge(df1, df2, on='key_column', how='inner') print(merged_df.head())
Output:
key_column column1_df1 column2_df1 column1_df2 column2_df2 0 key1 23 45 67 89 1 key2 10 20 30 40 ...
In this snippet, key_column
is the column used to merge the DataFrames, and column1_df1
, column2_df1
, etc., represent the other columns in your DataFrames.
The how='inner'
parameter performs an inner join, meaning only rows with matching keys in both DataFrames are included in the final result.
Understanding Types of Merge Joins
Each type of merge joins serves a specific purpose and can get different results depending on your data.
Here’s an overview of each type:
Inner Join
- What It Does: Returns only the rows that have matching values in both DataFrames.
- Use Case: Ideal when you want to retain only the rows that have corresponding data in both datasets.
Outer Join
- What It Does: Returns all rows from both DataFrames, filling in NaNs where there are no matches.
- Use Case: Useful when you want to retain all data from both datasets, regardless of whether there are matching entries.
Left Join
- What It Does: Returns all rows from the left DataFrame and the matched rows from the right DataFrame. Non-matched rows from the right DataFrame are replaced with NaN.
- Use Case: Suited for situations where you want all data from the left DataFrame, supplemented by corresponding data from the right DataFrame.
Right Join
- What It Does: Returns all rows from the right DataFrame and the matched rows from the left DataFrame. Non-matched rows from the left DataFrame are replaced with NaN.
- Use Case: Used when you want all data from the right DataFrame, enhanced by corresponding data from the left DataFrame.
Here’s a visual representation using sample code:
import pandas as pd df_left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3']}) df_right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'], 'B': ['B0', 'B1', 'B4', 'B5']}) inner_join_df = pd.merge(df_left, df_right, on='key', how='inner') outer_join_df = pd.merge(df_left, df_right, on='key', how='outer') left_join_df = pd.merge(df_left, df_right, on='key', how='left') right_join_df = pd.merge(df_left, df_right, on='key', how='right') print("Inner Join:\n", inner_join_df) print("Outer Join:\n", outer_join_df) print("Left Join:\n", left_join_df) print("Right Join:\n", right_join_df)
Output:
Inner Join: key A B 0 K0 A0 B0 1 K1 A1 B1 Outer Join: key A B 0 K0 A0 B0 1 K1 A1 B1 2 K2 A2 NaN 3 K3 A3 NaN 4 K4 NaN B4 5 K5 NaN B5 Left Join: key A B 0 K0 A0 B0 1 K1 A1 B1 2 K2 A2 NaN 3 K3 A3 NaN Right Join: key A B 0 K0 A0 B0 1 K1 A1 B1 2 K4 NaN B4 3 K5 NaN B5
Handling Duplicate Values During the Merge
Duplicate values can arise in various scenarios, such as when datasets have overlapping records or when combining data from multiple sources.
Here are some strategies to handle duplicates during a merge:
Identifying Duplicates
Use DataFrame.duplicated()
to identify duplicate rows based on specific columns or the entire DataFrame.
Removing Duplicates
Use DataFrame.drop_duplicates()
to remove duplicate rows. You can specify columns to consider for identifying duplicates.
Deciding Which Duplicates to Keep
With drop_duplicates()
, choose whether to keep the first occurrence (keep='first'
), the last occurrence (keep='last'
), or to drop all duplicates (keep=False
).
Here’s a sample code to demonstrate handling duplicates:
import pandas as pd df1 = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3'], 'A': ['A0', 'A1', 'A1', 'A3']}) df2 = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'], 'B': ['B0', 'B1', 'B4', 'B5']}) df1 = df1.drop_duplicates(subset='key', keep='first') merged_df = pd.merge(df1, df2, on='key', how='inner') merged_df = merged_df.drop_duplicates(keep='first') print(merged_df)
Output:
key A B 0 K0 A0 B0 1 K1 A1 B1
In this example, duplicates in df1
based on the ‘key’ column are removed before the merge.
After merging, we again check for duplicates and remove them if found.
Flattening Nested JSON for Merging
Flattening these nested JSON objects is essential for merging them into a Pandas DataFrame.
Flattening nested data into a tabular format makes it easier to merge.
You can use json_normalize()
function to flatten the nested JSON.
This function converts a list of nested dictionaries into a DataFrame. Specify the record path and meta parameters to control the level of flattening.
Here’s an example demonstrating flattening and merging:
import pandas as pd from pandas.io.json import json_normalize nested_json1 = [ {"key": "K1", "data": {"A": 1, "B": 2}}, {"key": "K2", "data": {"A": 3, "B": 4}} ] nested_json2 = [ {"key": "K1", "info": {"C": 5, "D": 6}}, {"key": "K2", "info": {"C": 7, "D": 8}} ] df1 = json_normalize(nested_json1, sep='_') df2 = json_normalize(nested_json2, sep='_') merged_df = pd.merge(df1, df2, on='key') print(merged_df)
Output:
key data_A data_B info_C info_D 0 K1 1 2 5 6 1 K2 3 4 7 8
The sep='_'
argument creates column names that reflect the nested hierarchy.
After flattening, the data is merged on the ‘key’ column.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.