Remove Duplicates from JSON array in Python

In this tutorial, you’ll learn how to remove duplicates from a JSON array in Python.

You’ll learn various techniques, from using sets for dealing with exact match duplicates to advanced strategies for handling deep nested JSON structures.

Table of Contents hide

1 Types of Duplicates in JSON Arrays
2 Using Python Sets for Exact Match Removal
3 Remove Key-Based Duplicates
- 3.1 Using a Custom Function
- 3.2 Using DataFrame Drop_Duplicates
4 Customized De-Duping (Allow Only X Occurrences)
5 Remove Value-Based Duplicates
- 5.1 Using a Custom Function
- 5.2 Using DataFrame Drop_Duplicates
6 Remove Duplicates from Nested JSON Arrays

Types of Duplicates in JSON Arrays

There are three types of duplicates: exact match, key-based, and value-based duplicates, each illustrated with an example.

Exact Match Duplicates:

[
    {"id": 101, "name": "John", "email": "john@example.com"},
    {"id": 102, "name": "Jane", "email": "jane@example.com"},
    {"id": 101, "name": "John", "email": "john@example.com"}  // Exact duplicate of the first entry
]

In this case, the entire JSON object is repeated in the array, making it an exact match duplicate.

Key-Based Duplicates:

[
    {"id": 201, "name": "Alice", "email": "alice@example.com"},
    {"id": 202, "name": "Bob", "email": "bob@example.com"},
    {"id": 201, "name": "Alison", "email": "alison@example.com"}  // Key-based duplicate based on 'id'
]

Here, duplicates are identified based on a specific key (‘id’), regardless of other differing key-value pairs.

Value-Based Duplicates:

[
    {"id": 301, "name": "Dave", "email": "dave@example.com"},
    {"id": 302, "name": "Eve", "email": "dave@example.com"},  // Value-based duplicate based on 'email'
    {"id": 303, "name": "Dave", "email": "dave@sample.com"}
]

In this case, duplicates are defined by identical values for a specific key (’email’), irrespective of other key names.

Using Python Sets for Exact Match Removal

When dealing with exact match duplicates in JSON arrays, one efficient approach in Python is using sets.

Sets are collections that do not allow duplicate elements, making them perfect for this task.

Here’s how you can use a set to remove exact match duplicates from a JSON array:

First, convert your JSON objects into a string format since sets cannot directly handle dictionaries.

Then, add these strings to a set. This process automatically removes any duplicates. Finally, convert the strings back into JSON objects.

import json
json_data = [
    {"id": 101, "name": "John", "email": "john@example.com"},
    {"id": 102, "name": "Jane", "email": "jane@example.com"},
    {"id": 101, "name": "John", "email": "john@example.com"}
]
unique_data = list({json.dumps(item, sort_keys=True) for item in json_data})
unique_json_data = [json.loads(item) for item in unique_data]
print(unique_json_data)

Output:

[
    {"email": "jane@example.com", "id": 102, "name": "Jane"},
    {"email": "john@example.com", "id": 101, "name": "John"}
]

In this code, json.dumps converts the JSON objects into strings, which are then added to a set to remove duplicates.

The json.loads function is used to convert the strings back into JSON format.

Remove Key-Based Duplicates

There are two ways to remove key-based duplicates: a custom function and Pandas DataFrame drop_duplicates function.

Using a Custom Function

You can create a custom function that iterates through the JSON data, tracking seen values for the specified key, and retains only the first occurrence of each value.

def remove_key_based_duplicates(json_data, key):
    seen = set()
    new_data = []
    for item in json_data:
        val = item.get(key)
        if val not in seen:
            seen.add(val)
            new_data.append(item)
    return new_data
json_data = [
    {"id": 201, "name": "Alice", "email": "alice@example.com"},
    {"id": 202, "name": "Bob", "email": "bob@example.com"},
    {"id": 201, "name": "Alison", "email": "alison@example.com"}
]
unique_json_data = remove_key_based_duplicates(json_data, 'id')
print(unique_json_data)

Output:

[
    {"id": 201, "name": "Alice", "email": "alice@example.com"},
    {"id": 202, "name": "Bob", "email": "bob@example.com"}
]

Using DataFrame Drop_Duplicates

Pandas DataFrame offers a more concise way to remove key-based duplicates using the drop_duplicates function.

import pandas as pd
json_data = [
    {"id": 201, "name": "Alice", "email": "alice@example.com"},
    {"id": 202, "name": "Bob", "email": "bob@example.com"},
    {"id": 201, "name": "Alison", "email": "alison@example.com"}
]
df = pd.DataFrame(json_data)
df_unique = df.drop_duplicates('id')
unique_json_data = df_unique.to_dict(orient='records')
print(unique_json_data)

Output:

[
    {"id": 201, "name": "Alice", "email": "alice@example.com"},
    {"id": 202, "name": "Bob", "email": "bob@example.com"}
]

This method is efficient and requires less code, especially for large datasets.

It handles the duplicate removal internally and provides an easy way to specify which column(s) to consider for removing duplicates.

Customized De-Duping (Allow Only X Occurrences)

Sometimes, you might want to allow a limited number of duplicate entries in your JSON array.

This is known as customized de-duping, where you permit only ‘X’ occurrences of a specific key-value pair.

To do this in Python, we’ll use a custom function that tracks the occurrence of each key-value pair and keeps only a specific number of occurrences.

def remove_duplicates_with_limit(json_data, key, max_occurrences):
    occurrence_count = {}
    new_data = []
    for item in json_data:
        val = item.get(key)
        if val not in occurrence_count:
            occurrence_count[val] = 0
        occurrence_count[val] += 1
        if occurrence_count[val] <= max_occurrences:
            new_data.append(item)
    return new_data
json_data = [
    {"id": 101, "name": "John", "email": "john@example.com"},
    {"id": 101, "name": "John", "email": "john@example.com"},
    {"id": 102, "name": "Jane", "email": "jane@example.com"},
    {"id": 101, "name": "John", "email": "john@example.com"}
]

# Remove duplicates allowing only 2 occurrences
unique_json_data = remove_duplicates_with_limit(json_data, 'id', 2)
print(unique_json_data)

Output:

[
    {"id": 101, "name": "John", "email": "john@example.com"},
    {"id": 101, "name": "John", "email": "john@example.com"},
    {"id": 102, "name": "Jane", "email": "jane@example.com"}
]

In this code, the function remove_duplicates_with_limit keeps a count of occurrences for each value of the specified key.

If the count exceeds the specified max_occurrences, further duplicates are discarded.

Remove Value-Based Duplicates

You can remove value-based duplicates by using a custom function or using Pandas DataFrame drop_duplicates function

Using a Custom Function

You can create a custom function that iterates over values and checks if the value has been seen before and if not seen, you can store it in a tuple for further checking:

def remove_value_based_duplicates(json_data, key_list):
    seen_values = set()
    new_data = []
    for item in json_data:
        # Create a tuple of values based on the key_list
        values = tuple(item.get(key) for key in key_list)
        if values not in seen_values:
            seen_values.add(values)
            new_data.append(item)
    return new_data
json_data = [
    {"id": 301, "name": "Dave", "email": "dave@example.com"},
    {"id": 302, "name": "Eve", "email": "dave@example.com"},
    {"id": 303, "name": "Dave", "email": "dave@sample.com"}
]
unique_json_data = remove_value_based_duplicates(json_data, ['email'])
print(unique_json_data)

Output:

[
    {"id": 301, "name": "Dave", "email": "dave@example.com"},
    {"id": 303, "name": "Dave", "email": "dave@sample.com"}
]

Here we checked the value-based duplicates on the email value.

Using DataFrame Drop_Duplicates

You can pass the field name you want to drop its duplicates to the subset parameter of drop_duplicates:

import pandas as pd
json_data = [
    {"id": 301, "name": "Dave", "email": "dave@example.com"},
    {"id": 302, "name": "Eve", "email": "dave@example.com"},
    {"id": 303, "name": "Dave", "email": "dave@sample.com"}
]
df = pd.DataFrame(json_data)
df_unique = df.drop_duplicates(subset=['email'])
unique_json_data = df_unique.to_dict(orient='records')
print(unique_json_data)

Output:

[
    {"id": 301, "name": "Dave", "email": "dave@example.com"},
    {"id": 303, "name": "Dave", "email": "dave@sample.com"}
]

Remove Duplicates from Nested JSON Arrays

To drop duplicates in deeply nested JSON arrays, you need to use a recursive function.

Recursion enables you to traverse and compare nested structures.

A recursive function calls itself to solve progressively smaller pieces of a larger problem.

For nested JSON arrays, the function will continue to go deeper levels of the array until it reaches the most nested objects, removing duplicates at each level.

import json
def remove_deep_duplicates(json_data):
    if isinstance(json_data, dict):
        # For dictionaries, apply function to each key-value pair
        return {k: remove_deep_duplicates(v) for k, v in json_data.items()}
    elif isinstance(json_data, list):
        # For lists, remove duplicates and apply function to each item
        unique_data = []
        seen = set()
        for item in json_data:
            # Generate a hash for the item
            item_hash = hash(json.dumps(item, sort_keys=True))
            if item_hash not in seen:
                seen.add(item_hash)
                unique_data.append(remove_deep_duplicates(item))
        return unique_data
    else:
        # Return the item if it's not a list or dict
        return json_data

# Sample nested JSON array
json_data = [
    {
        "id": 101,
        "details": {
            "name": "John",
            "contacts": [{"email": "john@example.com"}, {"email": "john@example.com"}]
        }
    },
    {
        "id": 101,
        "details": {
            "name": "John",
            "contacts": [{"email": "john@example.com"}]
        }
    }
]
unique_json_data = remove_deep_duplicates(json_data)
print(unique_json_data)

Output:

[
    {
        "id": 101,
        "details": {
            "name": "John",
            "contacts": [{"email": "john@example.com"}]
        }
    }
]

This function works by checking the type of each element. If it’s a list or dictionary, the function is called recursively.

For lists, it removes duplicates based on the hash of each item, ensuring unique items at each nesting level.

Mokhtar Ebrahim

Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.

Types of Duplicates in JSON Arrays

Using Python Sets for Exact Match Removal

Remove Key-Based Duplicates

Using a Custom Function

Using DataFrame Drop_Duplicates

Customized De-Duping (Allow Only X Occurrences)

Remove Value-Based Duplicates

Using a Custom Function

Using DataFrame Drop_Duplicates

Remove Duplicates from Nested JSON Arrays

Related posts

Leave a Reply Cancel reply