11 Amazing NumPy Shuffle Examples

Python’s NumPy package offers various methods that are used to perform operations involving randomness, such as the methods to randomly select one or more numbers from a given list of numbers, or to generate a random number in a given range, or to randomly generate a sample from a given distribution.

All these methods are offered under the random module of the NumPy package.
One such method is the numpy.random.shuffle method.
This method is used to randomly shuffle the elements of the given ‘mutable’ iterables.
Note that the reason for the iterables to be mutable is that the shuffling operation involves item re-assignment, which is not supported by immutable objects.

 

What are the benefits of shuffling?

The shuffling operation is fundamental to many applications where we want to introduce an element of chance while processing a given set of data.
It is particularly helpful in situations where we want to avoid any kind of bias to be introduced in the ordering of the data while it is being processed.

Shuffling operation is commonly used in machine learning pipelines where data are processed in batches.
Each time a batch is randomly selected from the dataset, it is preceded by a shuffling operation.
It can also be used to randomly sample items from a given set without replacement.

 

How to shuffle NumPy array?

Let us look at the basic usage of the np.random.shuffle method.
We will shuffle a 1-dimensional NumPy array.

import numpy as np
for i in range(5):
    a=np.array([1,2,4,5,6])
    print(f"a = {a}")
    np.random.shuffle(a)
    print(f"shuffled a = {a}\n")

Output:

output of numpy shuffle 1D array

Each time we call the shuffle method, we get a different order of the array a.

Note that the output you get when you run this code may differ from the output I got because, as we discussed, shuffle is a random operation.
In a later section, we will learn how to make these random operations deterministic to make the results reproducible.

 

Shuffle multiple NumPy arrays together

We saw how to shuffle a single NumPy array. Sometimes we want to shuffle multiple same-length arrays together, and in the same order.
While the shuffle method cannot accept more than 1 array, there is a way to achieve this by using another important method of the random module – np.random.permutation.

x = np.array([1,2,3,4,5,6])
y = np.array([10,20,30,40,50,60])
print(f"x = {x}, y = {y}")
shuffled_indices = np.random.permutation(len(x)) #return a permutation of the indices
print(f"shuffled indices: {shuffled_indices}")
x = x[shuffled_indices]
y = y[shuffled_indices]
print(f"shuffled x  = {x}\nshuffled y {y}")

Output:

output of shuffling multiple arrays togehter

We are first generating a random permutation of the integer values in the range [0, len(x)), and then using the same to index the two arrays.

If you are looking for a method that accepts multiple arrays together and shuffles them, then there exists one in the scikit-learn package – sklearn.utils.shuffle.

This method takes as many arrays as you want to shuffle and returns the shuffled arrays.

from sklearn.utils import shuffle
x = np.array([1,2,3,4,5,6])
y = np.array([10,20,30,40,50,60])
x_shuffled, y_shuffled = shuffle(x,y)
print(f"shuffled x = {x_shuffled}\nshuffled y={y_shuffled}")
print(f"original x = {x}, original y = {y}")

Output:

shuffling multiple arrays using sklearn

Note that this method does not perform in-place shuffling like np.random.shuffle does, instead, it returns the shuffled array without modifying the input arrays.
Since this method does not involve in-place item reassignment, we can also shuffle immutable iterables using this method.

 

Shuffle 2D arrays

We have seen the effect of NumPy’s shuffle method on 1-dimensional array.
Let us see what it does with 2D arrays.

x = np.random.randint(1,100, size=(3,3))
print(f"x:\n{x}\n")
np.random.shuffle(x)
print(f"shuffled x:\n{x}")

Output:

shuffling 2d array

If you look closely at the output, the order of the values in individual rows does not change; however, the positions of the rows in the array have been shuffled.
So the shuffle method shuffles the rows of a 2D array by default.

 

Shuffle columns of 2D NumPy array

We have seen in the last section the behavior of the shuffle method on a 2D array.
It shuffles the rows of the array in-place.

However, what do we do if we want to shuffle the columns of the array instead?
The shuffle method does not take any additional parameter to specify the axis along which we want to perform the shuffle.

So if we want to shuffle the columns of a 2D array using the np.random.shuffle method, we must find a way to treat the columns as rows or swap the columns with rows.
This is possible through transpose operation.

We will perform the shuffle on a transposed version of the 2D array, and since the shuffling occurs in-place, it will effectively shuffle the columns of the original array.

x = np.random.randint(1,100, size=(3,3))
print(f"x:\n{x}\n")
np.random.shuffle(x.T) #shuffling transposed form of x
print(f"column-wise shuffled x:\n{x}")

Output:

shuffling columns of 2d array

The columns of array x have been shuffled now, instead of the rows.

 

Shuffle multidimensional NumPy arrays

We have seen the behavior of the shuffle method on 1 and 2-dimensional arrays. Let us now try to understand what happens if we pass a higher dimensional array to this method.

Let us pass a 3-dimensional array to the np.random.shuffle method.

x = np.random.randint(1,100, size=(4,3,3))
print(f"x:\n{x}\n")
np.random.shuffle(x) 
print(f"shuffled x:\n{x}")

Output:

shuffling multidimensional arrays

Here the positions of the individual 3×3 arrays have been shuffled.

This behavior is similar to what we observed with 2-dimensional arrays.
The shuffle method, by default, shuffles any higher dimensional array along the first dimension i.e axis=0.

Shuffle along axis

If we want the array to be shuffled along any other axis, we can use the technique we discussed in the previous section.
We can generate a random permutation of the indices along that axis, and use it to index the array.

Let’s shuffle the 4x3x3 arrays along axis 1 and 2.

x = np.random.randint(1,100, size=(4,3,3))
print(f"x:\n{x}, shape={x.shape}\n")
indices_1 = np.random.permutation(x.shape[1])
x_1 = x[:,indices_1,:]
print(f"shuffled x along axis=1:\n{x_1}, shape={x_1.shape}\n")
indices_2 = np.random.permutation(x.shape[2])
x_2 = x[:,:,indices_2]
print(f"shuffled x along axis=2:\n{x_2}, shape={x_2.shape}\n")

Output:

shuffling multidimensional arrays along different dimensions

In the first output, when we shuffle along axis=1, the rows of each 3×3 array have been shuffled.
Similarly, when we shuffle along axis-2, the columns of the arrays have been shuffled.

 

Shuffle a list

In an earlier section, we discussed one of the conditions for the np.random.shuffle method to work is that the input must be a mutable object since the method involves in-place item reassignment.
Another condition for any shuffle method to work is that the input object must be subscriptable. That means the individual elements of the input can be identified and accessed using their positions or indices.

Among the basic data structures offered by Python, the list is the only data structure that satisfies both these conditions.
Sets and Dictionaries are mutable but not subscriptable.
Tuples and Strings are subscriptable but not mutable.

Let us shuffle a Python list using the np.random.shuffle method.

a = [5.4, 10.2, "hello", 9.8, 12, "world"]
print(f"a = {a}")
np.random.shuffle(a)
print(f"shuffle a = {a}")

Output:

shuffling Python list

If we want to shuffle a string or a tuple, we can either first convert it to a list, shuffle it and then convert it back to string/tuple;
Or we can use scikit-learn’s shuffle method to get a shuffled copy of it.

 

Shuffle with seed

If you have been implementing the code snippets while following this blog, you must have noticed that the results you get after performing a shuffle operation differ from the results shown in the output here.
This is because the shuffle is a random operation, and hence the results are not reproducible.

The random operations in programming languages are not truly random. These operations are performed with the help of a pseudo-random number generator, which is obtained by performing a series of mathematical operations on a number called ‘seed’.
If we fix the value of seed before performing a random operation or even a series of random operations, the final output will become deterministic and can be reproduced every time using the same seed.

Let us go back to the first shuffle operation we performed in this blog.
We shuffled a NumPy array five times in a row using a for loop, and each time we got a different output.
Let us now set a fixed random seed each time before shuffling the original array, and see if we get the same output or a different one.

for i in range(5):
    a=np.array([1,2,4,5,6])
    print(f"a = {a}")
    np.random.seed(42) #setting the random seed
    np.random.shuffle(a)
    print(f"shuffled a = {a}\n")

Output:

shuffle with seed

We are setting a random seed using np.random.seed() before each call to np.random.shuffle to make the shuffle operation deterministic.
However, it is not necessary to set the random seed before every call to a random operation.
If you set the random seed once before performing a series of random operations at different instances in your code; you can reproduce the output of the code later, on another day or a different machine, by setting the same seed at the beginning of the code.

 

Shuffle dimensions of NumPy array

We have so far performed shuffle operations on NumPy arrays without affecting the shape of the arrays.
We have been shuffling the content of the array along a chosen axis.

However, what if we want to shuffle the axes of the arrays instead of their elements?
NumPy arrays have a method called transpose, which accepts a tuple of axis indices and reshapes the array as per the order of the axes passed.

Let us build a 4-dimensional array of shape (2,3,2,4) and then shuffle its dimensions.

np.random.seed(0)
a = np.arange(48).reshape(2,3,2,4)
print(f"array a:\n{a}, shape={a.shape}\n")
shuffled_dimensions = np.random.permutation(a.ndim)
print(f"shuffled dimensions = {shuffled_dimensions}\n")
a_shuffled = a.transpose(shuffled_dimensions)
print(f"array a with shuffled dimensions:\n{a_shuffled}, shape={a_shuffled.shape}")

Output:

shuffling array dimensions

The original array was of the shape (2,3,2,4).
After we shuffled its dimensions, it was transformed into the shape (2,4,3,2).

 

Shuffle vs permutation

We have seen under multiple sections of this blog how NumPy’s permutation method can be used to perform the shuffling operation.
Not only does np.random.permutation help in shuffling arrays in ways that np.random.shuffle cannot,
But it can also achieve the same results that shuffle produces on lists and arrays.

In this section, we will learn the various similarities and differences between the two methods.

Let us first talk about the type of input that the two methods can accept.
While the shuffle method strictly accepts only subscriptable, mutable iterables; permutation, on the other hand, accepts immutable iterables and an integer, along with mutable iterables.
When you pass an integer to np.random.permutation, it returns a permutation of the range of integers from 0 up to that integer.

np.random.seed(42)
print(np.random.permutation(10))

Output:

permutation method with integer value

Next, let us talk about how the two methods perform the shuffling operation.
The shuffle method performs in-place shuffling on the original iterable that we pass to the method, and hence it returns None. So the original iterable ends up getting modified.
On the other hand, permutation always returns a NumPy array regardless of the input type, and so it does not modify the original input iterable.

Let us also compare the time it takes for the two methods to shuffle the same array.
We will run the two methods on the same array and log the time it takes for them to shuffle it.
We will log times for arrays of lengths ranging from 102 to 109, with increments of orders of 10.

import numpy as np
import time as time
permutation_time_log = []
shuffle_time_log = []
for i in range(2,10):
    print(f"shuffling array of length 10^{i}")
    a = np.random.randint(100, size=(10**i))
    t1 = time.time()
    np.random.permutation(a)
    t2 = time.time()
    permutation_time_log.append(t2-t1)
    t1 = time.time()
    np.random.shuffle(a)
    t2 = time.time()
    shuffle_time_log.append(t2-t1)
    del a

Note that we are deleting the array at the end of every loop to free up memory; this avoids any memory overhead during later iterations.

We have logged the time consumed by the two methods for arrays of increasing lengths.
Let us now plot them using pyplot.

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))
ax  = fig.add_subplot(111)
ax.plot(permutation_time_log, label="permutation")
ax.plot(shuffle_time_log, label="shuffle")
ax.set_xlabel("length of array")
ax.set_ylabel("time for shuffling(s)")
ax.set_xticks(range(8))
ax.set_xticklabels([f"10^{i}" for i in range(2,10)])
ax.legend()
plt.show()

Output:

time comparison of shuffle and permutation

It is evident from the figure that the two methods take almost the same time for arrays up to length 108,
and the difference between their times becomes more prominent beyond this point.
For arrays of lengths higher than 108, the shuffle method performs shuffling faster than permutation,
and its performance over the latter becomes more significant with increasing lengths.

Leave a Reply

Your email address will not be published. Required fields are marked *