Filter NumPy Array by Mask Array in Python

In this tutorial, you’ll learn how to filter NumPy array by mask array, we’ll apply masks to one-dimensional (1D) and two-dimensional (2D) arrays, and even higher-dimensional data.

We’ll also explore the concept of broadcasting masks to apply a single condition across multiple dimensions of an array.

 

 

1D Mask for 1D Array

You can use a mask array to filter 1D array.

First, import NumPy and initialize your data array:

import numpy as np
data_array = np.array([3.2, 4.5, 5.7, 2.2, 6.1, 3.3, 7.4, 1.8, 5.9])

Now, let’s define the mask array as per your specification:

mask_array = np.array([True, False, True, False, True, False, True, False, True])

With the mask array defined, apply it to the data array to filter the values:

filtered_data = data_array[mask_array]
print(filtered_data)

Output:

[3.2 5.7 6.1 7.4 5.9]

In this output, filtered_data contains elements from data_array corresponding to the True positions in mask_array.

 

1D Mask for 2D Array (Row or Column Filtering)

Applying a 1D mask for a row or column allows you to selectively extract entire rows or columns based on a condition applied across one dimension.

First, create a sample 2D array:

import numpy as np
user_metrics = np.array([
    [2.5, 180, 4.1],
    [5.1, 220, 3.8],
    [3.3, 140, 4.5],
    [7.8, 200, 3.2],
    [1.2, 80,  4.8],
    [4.5, 160, 3.9],
    [6.9, 240, 3.5]
])

Suppose you want to filter the rows where data usage is more than 4 GB. First, create a mask from the first column (data usage):

mask = user_metrics[:, 0] > 4  # Mask for data usage greater than 4 GB
print(mask)

Output:

[False  True False  True False  True  True]

The mask array contains True for rows where the condition (data usage > 4 GB) is met.

Now, apply this mask to filter the rows:

filtered_rows = user_metrics[mask]
print(filtered_rows)

Output:

[[  5.1 220.    3.8]
 [  7.8 200.    3.2]
 [  4.5 160.    3.9]
 [  6.9 240.    3.5]]

The filtered_rows array now includes only those rows from user_metrics where the data usage was more than 4 GB.

Similarly, you can modify this approach to filter columns, although that’s less common since columns usually represent different types of data.

 

Combining Multiple Masks

Let’s build on the previous examples and see how to combine multiple masks for more sophisticated data filtering.

First, recall our 2D array of user metrics:

import numpy as np
user_metrics = np.array([
    [2.5, 180, 4.1],  # Data in GB, Call duration in minutes, Customer rating
    [5.1, 220, 3.8],
    [3.3, 140, 4.5],
    [7.8, 200, 3.2],
    [1.2, 80,  4.8],
    [4.5, 160, 3.9],
    [6.9, 240, 3.5]
])

Suppose you want to filter users who have used more than 4 GB of data and have a customer rating above 3.5. Create two masks and then combine them:

mask1 = user_metrics[:, 0] > 4  # Data usage > 4 GB
mask2 = user_metrics[:, 2] > 3.5  # Customer rating > 3.5
combined_mask = mask1 & mask2
print(combined_mask)

Output:

[False  True False False False  True False]

The combined_mask is a result of combining mask1 and mask2 using the logical AND operation (&). It is True only where both conditions are met.

Now, apply this combined mask to filter the array:

filtered_data = user_metrics[combined_mask]
print(filtered_data)

Output:

[[  5.1 220.    3.8]
 [  4.5 160.    3.9]]

You can also use other logical operations like OR (|) and NOT (~) to create more diverse combinations of masks.

 

Masking in Higher-Dimensional Arrays

Imagine a scenario where a 3D array represents data usage, call duration, and customer ratings across different cities and time periods.

First, let’s create a sample 3D array:

import numpy as np

# Sample 3D array: dimensions might represent [City, Time Period, Metric]
telecom_data = np.array([
    [[2.5, 180, 4.1], [5.1, 220, 3.8], [3.3, 140, 4.5]],
    [[3.2, 150, 4.0], [7.8, 200, 3.2], [1.2, 80, 4.8]],
    [[4.5, 160, 3.9], [6.9, 240, 3.5], [2.1, 110, 4.2]]
])

In this array, let’s assume the first dimension is different cities, the second dimension is time periods, and the third dimension is various metrics.

To apply masking to this array, you first need to define a condition. Let’s say you want to identify where data usage exceeds 4 GB:

mask = telecom_data[:, :, 0] > 4  # Mask for data usage > 4 GB
print(mask)

Output:

[[False  True False]
 [False  True False]
 [ True  True False]]

This mask is a 2D array representing the condition applied across cities and time periods for the data usage metric.

Apply this mask to the 3D array:

filtered_data = telecom_data[mask]
print(filtered_data)

Output:

[[5.1 220.   3.8]
 [7.8 200.   3.2]
 [4.5 160.   3.9]
 [6.9 240.   3.5]]

In filtered_data, you get a flattened array of metric sets where the data usage exceeds 4 GB.

Note that the result is no longer a 3D structure because the mask is applied across two dimensions and flattened the data.

 

Broadcasting Masks

Broadcasting in NumPy allows you to apply operations across arrays of different shapes.

This concept extends to the use of masks as well, enabling you to apply a mask across an entire array, even if their dimensions don’t exactly match.

Consider a case where you have a 2D array representing various user metrics over time, and you want to apply a condition across all these metrics uniformly.

First, let’s create a sample 2D array:

import numpy as np
user_metrics = np.array([
    [2.5, 5.1, 3.3, 7.8, 1.2],
    [180, 220, 140, 200, 80],
    [4.1, 3.8, 4.5, 3.2, 4.8]
])

Suppose you have a mask based on a single condition that you want to apply across all rows. For instance, a condition that identifies metrics greater than 4:

mask = np.array([False, True, False, True, False])  # Mask to be broadcasted

Broadcast this mask across the entire 2D array:

broadcasted_mask = mask[np.newaxis, :]
filtered_metrics = user_metrics[:, broadcasted_mask[0]]
print(filtered_metrics)

Output:

[[  5.1   7.8]
 [220.  200. ]
 [  3.8   3.2]]

In this example, broadcasted_mask is a 2D version of the original mask, expanded to match the shape of user_metrics.

The mask is applied to all rows, filtering columns based on the mask’s condition.

Leave a Reply

Your email address will not be published. Required fields are marked *