Get Unique Values from JSON Array in Python

In this tutorial, we’ll explore several methods to extract unique values (remove duplicates) from JSON array in Python.

We’ll cover methods ranging from using Python sets and list comprehensions to more advanced methods such as collections.OrderedDict, custom functions, Pandas, and itertools.

 

 

Using Python Sets

A Python set is an unordered collection of items where every element is unique (i.e., no duplicates).

First, import the JSON module and load your data:

import json
json_data = '["iPhone 12", "Galaxy S20", "iPhone 12", "Galaxy S20", "Pixel 5"]'
phone_models = json.loads(json_data)

Now, convert this list into a set to automatically remove duplicates:

unique_models = set(phone_models)
print(unique_models)

Output:

{'Galaxy S20', 'iPhone 12', 'Pixel 5'}

In this output, you’ll notice that duplicates like ‘iPhone 12’ and ‘Galaxy S20’, which appeared twice in the original list, are now represented only once.

 

Using List Comprehension

Assuming you have already loaded the JSON data into a Python list named phone_models (as shown in the previous section), you can use list comprehension along with a set to filter out duplicates:

unique_models = list({model for model in phone_models})
print(unique_models)

Output:

['Galaxy S20', 'Pixel 5', 'iPhone 12']

In this output, the list comprehension inside the set {model for model in phone_models} removes the duplicates by converting the list into a set.

Then convert it back into a list with list(...) retains the unique values.

 

Using collections.OrderedDict

Here’s how to do it:

from collections import OrderedDict
unique_models_ordered = list(OrderedDict.fromkeys(phone_models))
print(unique_models_ordered)

Output:

['iPhone 12', 'Galaxy S20', 'Pixel 5']

In this output, OrderedDict.fromkeys(phone_models) creates an OrderedDict where each phone model from the phone_models list is a key.

Since keys in a dictionary are unique, this removes any duplicates.

The order in which elements are inserted is preserved. Converting it back to a list with list(...) provides a sequence of unique values in the order they first appeared in the original list.

 

Using a Custom Function

Creating a custom function to extract unique values from a JSON array is useful when you want more control over the process, such as adding additional conditions for uniqueness.

Here’s the custom function:

def get_unique_values(data_list):
    unique_list = []
    for item in data_list:
        if item not in unique_list:
            unique_list.append(item)
    return unique_list
unique_models_custom = get_unique_values(phone_models)
print(unique_models_custom)

Output:

['iPhone 12', 'Galaxy S20', 'Pixel 5']

This output shows that the function get_unique_values iterates through phone_models, adding each model to unique_list only if it is not already present.

As a result, unique_models_custom contains all unique phone models, preserving their original order.

 

Using Pandas

First, you’ll need to install Pandas if you haven’t already:

pip install pandas

Then, import Pandas and use it to handle the JSON data:

import pandas as pd
df = pd.DataFrame(phone_models, columns=['Model'])
unique_models_pandas = df['Model'].drop_duplicates().tolist()
print(unique_models_pandas)

Output:

['iPhone 12', 'Galaxy S20', 'Pixel 5']

By converting the JSON array into a DataFrame, you can apply the drop_duplicates() method on the ‘Model’ column.

This method removes duplicate values and the tolist() method converts the result back into a list format.

 

Using Python itertools

We can use itertools.groupby() to group consecutive duplicate elements, and then extract one element from each group to get the unique values.

Here’s how you can use itertools along with a list comprehension:

import itertools

# Assuming phone_models is sorted, if not, sort it first
phone_models.sort()
unique_models_itertools = [model for model, group in itertools.groupby(phone_models)]
print(unique_models_itertools)

Output:

['Galaxy S20', 'Pixel 5', 'iPhone 12']

In this output, the groupby function groups the list by consecutive identical elements.

The list comprehension iterates over these groups and picks the first item from each group, resulting in a list of unique values.

Note that this method will only remove consecutive duplicates. If the list is not sorted, identical items may not be adjacent and hence won’t be grouped.

 

Extracting Unique Values from Nested JSON Array

Imagine the JSON data now includes a list of phone models along with their features. The goal is to extract unique phone models from this nested structure.

import json
nested_json_data = '''
[
    {"model": "iPhone 12", "features": ["5G", "Dual Camera"]},
    {"model": "Galaxy S20", "features": ["AMOLED Display", "Water Resistant"]},
    {"model": "iPhone 12", "features": ["5G", "Dual Camera"]},
    {"model": "Pixel 5", "features": ["Night Sight", "Reverse Charging"]}
]
'''
data = json.loads(nested_json_data)
unique_models_nested = list({item['model'] for item in data})
print(unique_models_nested)

Output:

['Galaxy S20', 'iPhone 12', 'Pixel 5']

In this output, the set comprehension {item['model'] for item in data} goes through each dictionary in the loaded JSON array and extracts the ‘model’ value.

The set automatically removes any duplicates. Finally, converting it to a list with list(...) gives you a list of unique phone models.

 

Benchmark Test

To perform a benchmark test, we’ll use Python’s timeit module, which provides a simple way to time small bits of Python code.

We’ll use the callable one to test the performance of each method discussed earlier.

Let’s write a benchmark test for the different methods of extracting unique values:

import timeit
data = ["iPhone 12", "Galaxy S20", "iPhone 12", "Galaxy S20", "Pixel 5"] * 1000 #Large sample data

# Test using Python sets
time_sets = timeit.timeit('set(data)', globals=globals(), number=1000)

# Test using list comprehension
time_list_comp = timeit.timeit('list({model for model in data})', globals=globals(), number=1000)

# Test using OrderedDict
time_ordered_dict = timeit.timeit('list(OrderedDict.fromkeys(data))', globals=globals(), setup='from collections import OrderedDict', number=1000)

# Test using custom function
setup_custom_func = '''
def get_unique_values(data_list):
    unique_list = []
    for item in data_list:
        if item not in unique_list:
            unique_list.append(item)
    return unique_list
'''
time_custom_func = timeit.timeit('get_unique_values(data)', globals=globals(), setup=setup_custom_func, number=1000)

# Test using Pandas
time_pandas = timeit.timeit('pd.DataFrame(data, columns=["Model"]).drop_duplicates().Model.tolist()', globals=globals(), setup='import pandas as pd', number=1000)

# Test using itertools
time_itertools = timeit.timeit('[model for model, group in itertools.groupby(sorted(data))]', globals=globals(), setup='import itertools', number=1000)

print(f"Time using sets: {time_sets:.5f} seconds")
print(f"Time using list comprehension: {time_list_comp:.5f} seconds")
print(f"Time using OrderedDict: {time_ordered_dict:.5f} seconds")
print(f"Time using custom function: {time_custom_func:.5f} seconds")
print(f"Time using Pandas: {time_pandas:.5f} seconds")
print(f"Time using itertools: {time_itertools:.5f} seconds")

Output:

Time using sets: 0.05806 seconds
Time using list comprehension: 0.16215 seconds
Time using OrderedDict: 0.28626 seconds
Time using custom function: 0.29265 seconds
Time using Pandas: 1.11520 seconds
Time using itertools: 0.59240 seconds

This script will time each method 1000 times and output the average time taken.

Using sets is the fastest method.

It’s important to note that the performance might vary based on the size and complexity of the data, and different methods may be preferable in different cases.

Leave a Reply

Your email address will not be published. Required fields are marked *