Shuffling Non-zero Elements Of Each Row In An Array - Python / NumPy

August 28, 2022 Post a Comment

I have a an array that is relatively sparse, and I would like to go through each row and shuffle only the non-zero elements. Example Input: [2,3,1,0] [0,0,2,1] Example Output: [2,

Solution 1:

You could use the non-inplace numpy.random.permutation with explicit non-zero indexing:

>>> X = np.array([[2,3,1,0], [0,0,2,1]])
>>> for i in range(len(X)):
...     idx = np.nonzero(X[i])
...     X[i][idx] = np.random.permutation(X[i][idx])
... 
>>> X
array([[3, 2, 1, 0],
       [0, 0, 2, 1]])

Solution 2:

I think I found the three-liner?

i, j = np.nonzero(a.astype(bool))
k = np.argsort(i + np.random.rand(i.size))
a[i,j] = a[i,j[k]]

Solution 3:

As promised, this being the fourth day of the bounty period, here's my attempt at a vectorized solution. The steps involved are explained in some details below :

For easy reference, let's call the input array as a. Generate unique indices per row that covers the range for row length. For this, we can simply generate random numbers of the same shape as the input array and get the argsort indices along each row, which would be those unique indices. This idea has been explored before in this post.
Index into each row of input array with those indices as columns indices. Thus, we would need advanced-indexing here. Now, this gives us an array with each row being shuffled. Let's call it b.
Since the shuffling is restricted to per row, if we simply use the boolean-indexing : b[b!=0], we would get the non-zero elements being shuffled and also being restricted to lengths of non-zeros per row. This is because of the fact that the elements in a NumPy array are stored in row-major order, so with boolean-indexing it would have selected shuffled non-zero elements on each row first before moving onto the next row. Again, if we use boolean-indexing similarly for a, i.e. a[a!=0], we would have similarly gotten the non-zero elements on each row first before moving onto the next row and these would be in their original order. So, the final step would be to just grab masked elements b[b!=0] and assign into the masked places a[a!=0].

Thus, an implementation covering the above mentioned three steps would be -

m,n = a.shape
rand_idx = np.random.rand(m,n).argsort(axis=1) #step1
b = a[np.arange(m)[:,None], rand_idx]          #step2  
a[a!=0] = b[b!=0]                              #step3

A sample step-by-step run might make things clearer -

In [50]: a # Input array
Out[50]: 
array([[ 8,  5,  0, -4],
       [ 0,  6,  0,  3],
       [ 8,  5,  0, -4]])

In [51]: m,n = a.shape # Store shape information

# Unique indices per row that covers the range for row length
In [52]: rand_idx = np.random.rand(m,n).argsort(axis=1)

In [53]: rand_idx
Out[53]: 
array([[0, 2, 3, 1],
       [1, 0, 3, 2],
       [2, 3, 0, 1]])

# Get corresponding indexed array
In [54]: b = a[np.arange(m)[:,None], rand_idx]

# Do a check on the shuffling being restricted to per row
In [55]: a[a!=0]
Out[55]: array([ 8,  5, -4,  6,  3,  8,  5, -4])

In [56]: b[b!=0]
Out[56]: array([ 8, -4,  5,  6,  3, -4,  8,  5])

# Finally do the assignment based on masking on a and b
In [57]: a[a!=0] = b[b!=0]

In [58]: a # Final verification on desired result
Out[58]: 
array([[ 8, -4,  0,  5],
       [ 0,  6,  0,  3],
       [-4,  8,  0,  5]])

Solution 4:

Benchmarking for the vectorized solutions

We are looking to benchmark vectorized solutions in this post. Now, the vectorization tries to avoid the looping that we would loop through each row and do the shuffling. So, the setup for the input array involves a greater number of rows.

Approaches -

def app1(a): # @Daniel F's soln
    i, j = np.nonzero(a.astype(bool))
    k = np.argsort(i + np.random.rand(i.size))
    a[i,j] = a[i,j[k]]
    return a

def app2(x): # @kazemakase's soln
    r, c = np.where(x != 0)
    n = c.size
    perm = np.random.permutation(n)
    i = np.argsort(perm + r * n)
    x[r, c] = x[r, c[i]]
    return x

def app3(a): # @Divakar's soln
    m,n = a.shape
    rand_idx = np.random.rand(m,n).argsort(axis=1)
    b = a[np.arange(m)[:,None], rand_idx]
    a[a!=0] = b[b!=0]
    return a

from scipy.ndimage.measurements import labeled_comprehension
def app4(a): # @FabienP's soln
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1),\
                                func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Benchmarking procedure #1

For a fair benchmarking, it seemed reasonable to use OP's sample and simply stack those as more rows to get a bigger dataset. Thus, with that setup we could create two cases with 2 million and 20 million rows datasets.

Case #1 : Large dataset with 2*1000,000 rows

In [174]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [175]: a = np.vstack([a]*1000000)

In [176]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: %timeit app4(a)
     ...: 
1 loop, best of 3: 264 ms per loop
1 loop, best of 3: 422 ms per loop
1 loop, best of 3: 254 ms per loop
1 loop, best of 3: 14.3 s per loop

Case #2 : Larger dataset with 2*10,000,000 rows

In [177]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [178]: a = np.vstack([a]*10000000)

# app4 skipped here as it was slower on the previous smaller dataset
In [179]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: 
1 loop, best of 3: 2.86 s per loop
1 loop, best of 3: 4.62 s per loop
1 loop, best of 3: 2.55 s per loop

Benchmarking procedure #2 : Extensive one

To cover all cases of varying percentage of non-zeros in the input array, we are covering some extensive benchmarking scenarios. Also, since app4 seemed much slower than others, for a closer inspection we are skipping that one in this section.

Helper function to setup input array :

def in_data(n_col, nnz_ratio):
    # max no. of elems that my system can handle, i.e. stretching it to limits.
    # The idea is to use this to decide the number of rows and always use
    # max. possible dataset size
    num_elem = 10000000

    n_row = num_elem//n_col
    a = np.zeros((n_row, n_col),dtype=int)
    L = int(round(a.size*nnz_ratio))
    a.ravel()[np.random.choice(a.size, L, replace=0)] = np.random.randint(1,6,L)
    return a

Main timing and plotting script (Uses IPython magic functions. So, needs to be run opon copying and pasting onto IPython console) -

import matplotlib.pyplot as plt

# Setup input params
nnz_ratios = np.array([0.2, 0.4, 0.6, 0.8])
n_cols = np.array([4, 5, 8, 10, 15, 20, 25, 50])

init_arr1 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr2 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr3 = np.zeros((len(nnz_ratios), len(n_cols) ))

timings = {app1:init_arr1, app2:init_arr2, app3:init_arr3}
for i,nnz_ratio in enumerate(nnz_ratios):
    for j,n_col in enumerate(n_cols):
        a = in_data(n_col, nnz_ratio=nnz_ratio)
        for func in timings:
            res = %timeit -oq func(a)
            timings[func][i,j] = res.best
            print func.__name__, i, j, res.best

fig = plt.figure(1)
colors = ['b','k','r']
for i in range(len(nnz_ratios)):
    ax = plt.subplot(2,2,i+1)
    for f,func in enumerate(timings):
        ax.plot(n_cols, 
                [time for time in timings[func][i]], 
                label=str(func.__name__), color=colors[f])
    ax.set_xlabel('No. of cols')
    ax.set_ylabel('time [seconds]')
    ax.grid(which='both')
    ax.legend()
    plt.tight_layout()
    plt.title('Percentage non-zeros : '+str(int(100*nnz_ratios[i])) + '%')
plt.subplots_adjust(wspace=0.2, hspace=0.2)

Timings output -

Observations :

Approaches #1, #2 does argsort on the non-zero elements across the entire input array. As such, it performs better with lesser percentage of non-zeros.
Approach #3 creates random numbers of the same shape as the input array and then gets argsort indices per row. Thus, for a given number of non-zeros in the input, the timings for it are more steep-ish than first two approaches.

Conclusion :

Approach #1 seems to be doing pretty well until 60% non-zero mark. For more non-zeros and if the row-lengths are small, approach #3 seems to be performing decently.

Solution 5:

I came up with that:

nz = a.nonzero()                      # Get nonzero indexes
a[nz] = np.random.permutation(a[nz])  # Shuffle nonzero values with mask

Which look simpler (and a little bit faster?) than other proposed solutions.

EDIT: new version that does not mix rows

 labels, *idx = nz = a.nonzero()                                    # get masks
 a[nz] = np.concatenate([np.random.permutation(a[nz][labels == i])  # permute values
                         for i in np.unique(labels)])               # for each label

Where the first array of a.nonzero() (indexes of non zero values in axis0) is used as labels. This is the trick that does not mix rows.

Then np.random.permutation is applied on a[a.nonzero()] for each "label" independently.

Supposedly scipy.ndimage.measurements.labeled_comprehension can be used here, by it seems to fail with np.random.permutation.

And I finally saw that it looks a lot like what @randomir proposed. Anyway, it was just for the challenge of getting it to work.

EDIT2:

Finally got it working with scipy.ndimage.measurements.labeled_comprehension

def shuffle_rows(a):
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, *idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1), func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Where:

func() shuffles the non zero values
labeled_comprehension applies func() label-wise

This replaces the previous for loop and will be faster on arrays with many rows.

Python Programming Language