Shuffling Non-zero Elements Of Each Row In An Array - Python / NumPy
Solution 1:
You could use the non-inplace numpy.random.permutation
with explicit non-zero indexing:
>>> X = np.array([[2,3,1,0], [0,0,2,1]])
>>> for i in range(len(X)):
... idx = np.nonzero(X[i])
... X[i][idx] = np.random.permutation(X[i][idx])
...
>>> X
array([[3, 2, 1, 0],
[0, 0, 2, 1]])
Solution 2:
I think I found the three-liner?
i, j = np.nonzero(a.astype(bool))
k = np.argsort(i + np.random.rand(i.size))
a[i,j] = a[i,j[k]]
Solution 3:
As promised, this being the fourth day of the bounty period, here's my attempt at a vectorized solution. The steps involved are explained in some details below :
For easy reference, let's call the input array as
a
. Generate unique indices per row that covers the range for row length. For this, we can simply generate random numbers of the same shape as the input array and get theargsort
indices along each row, which would be those unique indices. This idea has been explored before inthis post
.Index into each row of input array with those indices as columns indices. Thus, we would need
advanced-indexing
here. Now, this gives us an array with each row being shuffled. Let's call itb
.Since the shuffling is restricted to per row, if we simply use the boolean-indexing :
b[b!=0]
, we would get the non-zero elements being shuffled and also being restricted to lengths of non-zeros per row. This is because of the fact that the elements in a NumPy array are stored in row-major order, so with boolean-indexing it would have selected shuffled non-zero elements on each row first before moving onto the next row. Again, if we use boolean-indexing similarly fora
, i.e.a[a!=0]
, we would have similarly gotten the non-zero elements on each row first before moving onto the next row and these would be in their original order. So, the final step would be to just grab masked elementsb[b!=0]
and assign into the masked placesa[a!=0]
.
Thus, an implementation covering the above mentioned three steps would be -
m,n = a.shape
rand_idx = np.random.rand(m,n).argsort(axis=1) #step1
b = a[np.arange(m)[:,None], rand_idx] #step2
a[a!=0] = b[b!=0] #step3
A sample step-by-step run might make things clearer -
In [50]: a # Input array
Out[50]:
array([[ 8, 5, 0, -4],
[ 0, 6, 0, 3],
[ 8, 5, 0, -4]])
In [51]: m,n = a.shape # Store shape information
# Unique indices per row that covers the range for row length
In [52]: rand_idx = np.random.rand(m,n).argsort(axis=1)
In [53]: rand_idx
Out[53]:
array([[0, 2, 3, 1],
[1, 0, 3, 2],
[2, 3, 0, 1]])
# Get corresponding indexed array
In [54]: b = a[np.arange(m)[:,None], rand_idx]
# Do a check on the shuffling being restricted to per row
In [55]: a[a!=0]
Out[55]: array([ 8, 5, -4, 6, 3, 8, 5, -4])
In [56]: b[b!=0]
Out[56]: array([ 8, -4, 5, 6, 3, -4, 8, 5])
# Finally do the assignment based on masking on a and b
In [57]: a[a!=0] = b[b!=0]
In [58]: a # Final verification on desired result
Out[58]:
array([[ 8, -4, 0, 5],
[ 0, 6, 0, 3],
[-4, 8, 0, 5]])
Solution 4:
Benchmarking for the vectorized solutions
We are looking to benchmark vectorized solutions in this post. Now, the vectorization tries to avoid the looping that we would loop through each row and do the shuffling. So, the setup for the input array involves a greater number of rows.
Approaches -
def app1(a): # @Daniel F's soln
i, j = np.nonzero(a.astype(bool))
k = np.argsort(i + np.random.rand(i.size))
a[i,j] = a[i,j[k]]
return a
def app2(x): # @kazemakase's soln
r, c = np.where(x != 0)
n = c.size
perm = np.random.permutation(n)
i = np.argsort(perm + r * n)
x[r, c] = x[r, c[i]]
return x
def app3(a): # @Divakar's soln
m,n = a.shape
rand_idx = np.random.rand(m,n).argsort(axis=1)
b = a[np.arange(m)[:,None], rand_idx]
a[a!=0] = b[b!=0]
return a
from scipy.ndimage.measurements import labeled_comprehension
def app4(a): # @FabienP's soln
def func(array, idx):
r[idx] = np.random.permutation(array)
return True
labels, idx = nz = a.nonzero()
r = a[nz]
labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1),\
func, int, 0, pass_positions=True)
a[nz] = r
return a
Benchmarking procedure #1
For a fair benchmarking, it seemed reasonable to use OP's sample and simply stack those as more rows to get a bigger dataset. Thus, with that setup we could create two cases with 2 million and 20 million rows datasets.
Case #1 : Large dataset with 2*1000,000
rows
In [174]: a = np.array([[2,3,1,0],[0,0,2,1]])
In [175]: a = np.vstack([a]*1000000)
In [176]: %timeit app1(a)
...: %timeit app2(a)
...: %timeit app3(a)
...: %timeit app4(a)
...:
1 loop, best of 3: 264 ms per loop
1 loop, best of 3: 422 ms per loop
1 loop, best of 3: 254 ms per loop
1 loop, best of 3: 14.3 s per loop
Case #2 : Larger dataset with 2*10,000,000
rows
In [177]: a = np.array([[2,3,1,0],[0,0,2,1]])
In [178]: a = np.vstack([a]*10000000)
# app4 skipped here as it was slower on the previous smaller dataset
In [179]: %timeit app1(a)
...: %timeit app2(a)
...: %timeit app3(a)
...:
1 loop, best of 3: 2.86 s per loop
1 loop, best of 3: 4.62 s per loop
1 loop, best of 3: 2.55 s per loop
Benchmarking procedure #2 : Extensive one
To cover all cases of varying percentage of non-zeros in the input array, we are covering some extensive benchmarking scenarios. Also, since app4
seemed much slower than others, for a closer inspection we are skipping that one in this section.
Helper function to setup input array :
def in_data(n_col, nnz_ratio):
# max no. of elems that my system can handle, i.e. stretching it to limits.
# The idea is to use this to decide the number of rows and always use
# max. possible dataset size
num_elem = 10000000
n_row = num_elem//n_col
a = np.zeros((n_row, n_col),dtype=int)
L = int(round(a.size*nnz_ratio))
a.ravel()[np.random.choice(a.size, L, replace=0)] = np.random.randint(1,6,L)
return a
Main timing and plotting script (Uses IPython magic functions. So, needs to be run opon copying and pasting onto IPython console) -
import matplotlib.pyplot as plt
# Setup input params
nnz_ratios = np.array([0.2, 0.4, 0.6, 0.8])
n_cols = np.array([4, 5, 8, 10, 15, 20, 25, 50])
init_arr1 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr2 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr3 = np.zeros((len(nnz_ratios), len(n_cols) ))
timings = {app1:init_arr1, app2:init_arr2, app3:init_arr3}
for i,nnz_ratio in enumerate(nnz_ratios):
for j,n_col in enumerate(n_cols):
a = in_data(n_col, nnz_ratio=nnz_ratio)
for func in timings:
res = %timeit -oq func(a)
timings[func][i,j] = res.best
print func.__name__, i, j, res.best
fig = plt.figure(1)
colors = ['b','k','r']
for i in range(len(nnz_ratios)):
ax = plt.subplot(2,2,i+1)
for f,func in enumerate(timings):
ax.plot(n_cols,
[time for time in timings[func][i]],
label=str(func.__name__), color=colors[f])
ax.set_xlabel('No. of cols')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
plt.title('Percentage non-zeros : '+str(int(100*nnz_ratios[i])) + '%')
plt.subplots_adjust(wspace=0.2, hspace=0.2)
Timings output -
Observations :
Approaches #1, #2 does
argsort
on the non-zero elements across the entire input array. As such, it performs better with lesser percentage of non-zeros.Approach #3 creates random numbers of the same shape as the input array and then gets
argsort
indices per row. Thus, for a given number of non-zeros in the input, the timings for it are more steep-ish than first two approaches.
Conclusion :
Approach #1 seems to be doing pretty well until 60% non-zero mark. For more non-zeros and if the row-lengths are small, approach #3 seems to be performing decently.
Solution 5:
I came up with that:
nz = a.nonzero() # Get nonzero indexes
a[nz] = np.random.permutation(a[nz]) # Shuffle nonzero values with mask
Which look simpler (and a little bit faster?) than other proposed solutions.
EDIT: new version that does not mix rows
labels, *idx = nz = a.nonzero() # get masks
a[nz] = np.concatenate([np.random.permutation(a[nz][labels == i]) # permute values
for i in np.unique(labels)]) # for each label
Where the first array of a.nonzero()
(indexes of non zero values in axis0) is used as labels. This is the trick that does not mix rows.
Then np.random.permutation
is applied on a[a.nonzero()]
for each "label" independently.
Supposedly scipy.ndimage.measurements.labeled_comprehension
can be used here, by it seems to fail with np.random.permutation
.
And I finally saw that it looks a lot like what @randomir proposed. Anyway, it was just for the challenge of getting it to work.
EDIT2:
Finally got it working with scipy.ndimage.measurements.labeled_comprehension
def shuffle_rows(a):
def func(array, idx):
r[idx] = np.random.permutation(array)
return True
labels, *idx = nz = a.nonzero()
r = a[nz]
labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1), func, int, 0, pass_positions=True)
a[nz] = r
return a
Where:
func()
shuffles the non zero valueslabeled_comprehension
appliesfunc()
label-wise
This replaces the previous for loop and will be faster on arrays with many rows.
Post a Comment for "Shuffling Non-zero Elements Of Each Row In An Array - Python / NumPy"