Remove Duplicates from JSON array in Python
In this tutorial, you’ll learn how to remove duplicates from a JSON array in Python.
You’ll learn various techniques, from using sets for dealing with exact match duplicates to advanced strategies for handling deep nested JSON structures.
Types of Duplicates in JSON Arrays
There are three types of duplicates: exact match, key-based, and value-based duplicates, each illustrated with an example.
Exact Match Duplicates:
[ {"id": 101, "name": "John", "email": "john@example.com"}, {"id": 102, "name": "Jane", "email": "jane@example.com"}, {"id": 101, "name": "John", "email": "john@example.com"} // Exact duplicate of the first entry ]
In this case, the entire JSON object is repeated in the array, making it an exact match duplicate.
Key-Based Duplicates:
[ {"id": 201, "name": "Alice", "email": "alice@example.com"}, {"id": 202, "name": "Bob", "email": "bob@example.com"}, {"id": 201, "name": "Alison", "email": "alison@example.com"} // Key-based duplicate based on 'id' ]
Here, duplicates are identified based on a specific key (‘id’), regardless of other differing key-value pairs.
Value-Based Duplicates:
[ {"id": 301, "name": "Dave", "email": "dave@example.com"}, {"id": 302, "name": "Eve", "email": "dave@example.com"}, // Value-based duplicate based on 'email' {"id": 303, "name": "Dave", "email": "dave@sample.com"} ]
In this case, duplicates are defined by identical values for a specific key (’email’), irrespective of other key names.
Using Python Sets for Exact Match Removal
When dealing with exact match duplicates in JSON arrays, one efficient approach in Python is using sets.
Sets are collections that do not allow duplicate elements, making them perfect for this task.
Here’s how you can use a set to remove exact match duplicates from a JSON array:
First, convert your JSON objects into a string format since sets cannot directly handle dictionaries.
Then, add these strings to a set. This process automatically removes any duplicates. Finally, convert the strings back into JSON objects.
import json json_data = [ {"id": 101, "name": "John", "email": "john@example.com"}, {"id": 102, "name": "Jane", "email": "jane@example.com"}, {"id": 101, "name": "John", "email": "john@example.com"} ] unique_data = list({json.dumps(item, sort_keys=True) for item in json_data}) unique_json_data = [json.loads(item) for item in unique_data] print(unique_json_data)
Output:
[ {"email": "jane@example.com", "id": 102, "name": "Jane"}, {"email": "john@example.com", "id": 101, "name": "John"} ]
In this code, json.dumps
converts the JSON objects into strings, which are then added to a set to remove duplicates.
The json.loads
function is used to convert the strings back into JSON format.
Remove Key-Based Duplicates
There are two ways to remove key-based duplicates: a custom function and Pandas DataFrame drop_duplicates
function.
Using a Custom Function
You can create a custom function that iterates through the JSON data, tracking seen values for the specified key, and retains only the first occurrence of each value.
def remove_key_based_duplicates(json_data, key): seen = set() new_data = [] for item in json_data: val = item.get(key) if val not in seen: seen.add(val) new_data.append(item) return new_data json_data = [ {"id": 201, "name": "Alice", "email": "alice@example.com"}, {"id": 202, "name": "Bob", "email": "bob@example.com"}, {"id": 201, "name": "Alison", "email": "alison@example.com"} ] unique_json_data = remove_key_based_duplicates(json_data, 'id') print(unique_json_data)
Output:
[ {"id": 201, "name": "Alice", "email": "alice@example.com"}, {"id": 202, "name": "Bob", "email": "bob@example.com"} ]
Using DataFrame Drop_Duplicates
Pandas DataFrame offers a more concise way to remove key-based duplicates using the drop_duplicates
function.
import pandas as pd json_data = [ {"id": 201, "name": "Alice", "email": "alice@example.com"}, {"id": 202, "name": "Bob", "email": "bob@example.com"}, {"id": 201, "name": "Alison", "email": "alison@example.com"} ] df = pd.DataFrame(json_data) df_unique = df.drop_duplicates('id') unique_json_data = df_unique.to_dict(orient='records') print(unique_json_data)
Output:
[ {"id": 201, "name": "Alice", "email": "alice@example.com"}, {"id": 202, "name": "Bob", "email": "bob@example.com"} ]
This method is efficient and requires less code, especially for large datasets.
It handles the duplicate removal internally and provides an easy way to specify which column(s) to consider for removing duplicates.
Customized De-Duping (Allow Only X Occurrences)
Sometimes, you might want to allow a limited number of duplicate entries in your JSON array.
This is known as customized de-duping, where you permit only ‘X’ occurrences of a specific key-value pair.
To do this in Python, we’ll use a custom function that tracks the occurrence of each key-value pair and keeps only a specific number of occurrences.
def remove_duplicates_with_limit(json_data, key, max_occurrences): occurrence_count = {} new_data = [] for item in json_data: val = item.get(key) if val not in occurrence_count: occurrence_count[val] = 0 occurrence_count[val] += 1 if occurrence_count[val] <= max_occurrences: new_data.append(item) return new_data json_data = [ {"id": 101, "name": "John", "email": "john@example.com"}, {"id": 101, "name": "John", "email": "john@example.com"}, {"id": 102, "name": "Jane", "email": "jane@example.com"}, {"id": 101, "name": "John", "email": "john@example.com"} ] # Remove duplicates allowing only 2 occurrences unique_json_data = remove_duplicates_with_limit(json_data, 'id', 2) print(unique_json_data)
Output:
[ {"id": 101, "name": "John", "email": "john@example.com"}, {"id": 101, "name": "John", "email": "john@example.com"}, {"id": 102, "name": "Jane", "email": "jane@example.com"} ]
In this code, the function remove_duplicates_with_limit
keeps a count of occurrences for each value of the specified key.
If the count exceeds the specified max_occurrences
, further duplicates are discarded.
Remove Value-Based Duplicates
You can remove value-based duplicates by using a custom function or using Pandas DataFrame drop_duplicates
function
Using a Custom Function
You can create a custom function that iterates over values and checks if the value has been seen before and if not seen, you can store it in a tuple for further checking:
def remove_value_based_duplicates(json_data, key_list): seen_values = set() new_data = [] for item in json_data: # Create a tuple of values based on the key_list values = tuple(item.get(key) for key in key_list) if values not in seen_values: seen_values.add(values) new_data.append(item) return new_data json_data = [ {"id": 301, "name": "Dave", "email": "dave@example.com"}, {"id": 302, "name": "Eve", "email": "dave@example.com"}, {"id": 303, "name": "Dave", "email": "dave@sample.com"} ] unique_json_data = remove_value_based_duplicates(json_data, ['email']) print(unique_json_data)
Output:
[ {"id": 301, "name": "Dave", "email": "dave@example.com"}, {"id": 303, "name": "Dave", "email": "dave@sample.com"} ]
Here we checked the value-based duplicates on the email value.
Using DataFrame Drop_Duplicates
You can pass the field name you want to drop its duplicates to the subset
parameter of drop_duplicates
:
import pandas as pd json_data = [ {"id": 301, "name": "Dave", "email": "dave@example.com"}, {"id": 302, "name": "Eve", "email": "dave@example.com"}, {"id": 303, "name": "Dave", "email": "dave@sample.com"} ] df = pd.DataFrame(json_data) df_unique = df.drop_duplicates(subset=['email']) unique_json_data = df_unique.to_dict(orient='records') print(unique_json_data)
Output:
[ {"id": 301, "name": "Dave", "email": "dave@example.com"}, {"id": 303, "name": "Dave", "email": "dave@sample.com"} ]
Remove Duplicates from Nested JSON Arrays
To drop duplicates in deeply nested JSON arrays, you need to use a recursive function.
Recursion enables you to traverse and compare nested structures.
A recursive function calls itself to solve progressively smaller pieces of a larger problem.
For nested JSON arrays, the function will continue to go deeper levels of the array until it reaches the most nested objects, removing duplicates at each level.
import json def remove_deep_duplicates(json_data): if isinstance(json_data, dict): # For dictionaries, apply function to each key-value pair return {k: remove_deep_duplicates(v) for k, v in json_data.items()} elif isinstance(json_data, list): # For lists, remove duplicates and apply function to each item unique_data = [] seen = set() for item in json_data: # Generate a hash for the item item_hash = hash(json.dumps(item, sort_keys=True)) if item_hash not in seen: seen.add(item_hash) unique_data.append(remove_deep_duplicates(item)) return unique_data else: # Return the item if it's not a list or dict return json_data # Sample nested JSON array json_data = [ { "id": 101, "details": { "name": "John", "contacts": [{"email": "john@example.com"}, {"email": "john@example.com"}] } }, { "id": 101, "details": { "name": "John", "contacts": [{"email": "john@example.com"}] } } ] unique_json_data = remove_deep_duplicates(json_data) print(unique_json_data)
Output:
[ { "id": 101, "details": { "name": "John", "contacts": [{"email": "john@example.com"}] } } ]
This function works by checking the type of each element. If it’s a list or dictionary, the function is called recursively.
For lists, it removes duplicates based on the hash of each item, ensuring unique items at each nesting level.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.