Why Is B = Numpy.dot(a,x) So Much Slower Looping Through Doing B[i,:,:] = Numpy.dot(a[i,:,:],x) )?
Solution 1:
With smaller dims 10,100,200
, I get a similar ranking
In [355]: %%timeit
.....: B=np.zeros((N,M,L))
.....: for i in range(N):
B[i,:,:]=np.dot(A[i,:,:],x)
.....:
10 loops, best of 3: 22.5 ms per loop
In [356]: timeit np.dot(A,x)
10 loops, best of 3: 44.2 ms per loop
In [357]: timeit np.einsum('ijk,km->ijm',A,x)
10 loops, best of 3: 29 ms per loop
In [367]: timeit np.dot(A.reshape(-1,M),x).reshape(N,M,L)
10 loops, best of 3: 22.1 ms per loop
In [375]: timeit np.tensordot(A,x,(2,0))
10 loops, best of 3: 22.2 ms per loop
the itererative is faster, though not by as much as in your case.
This is probably true as long as that iterating dimension is small compared to the other ones. In that case the overhead of iteration (function calls etc) is small compared to the calculation time. And doing all the values at once uses more memory.
I tried a dot
variation where I reshaped A
into 2d, thinking that dot
does that kind of reshaping internally. I'm surprised that it is actually fastest. tensordot
is probably doing the same reshaping (that code if Python readable).
einsum
sets up a 'sum of products' iteration involving 4 variables, the i,j,k,m
- that is dim1*dim2*dim2*dim3
steps with the C level nditer
. So the more indices you have the larger the iteration space.
Solution 2:
numpy.dot
only delegates to a BLAS matrix multiply when the inputs each have dimension at most 2:
#if defined(HAVE_CBLAS)if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
(NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
return cblas_matrixproduct(typenum, ap1, ap2, out);
}
#endif
When you stick your whole 3-dimensional A
array into dot
, NumPy takes a slower path, going through an nditer
object. It still tries to get some use out of BLAS in the slow path, but the way the slow path is designed, it can only use a vector-vector multiply rather than a matrix-matrix multiply, which doesn't give the BLAS anywhere near as much room to optimize.
Solution 3:
I am not too familiar with numpy's C-API, and the numpy.dot
is one such builtin function that used to be under _dotblas
in earlier versions.
Nevertheless, here are my thoughts.
1)numpy.dot
takes different paths for 2-dimensional arrays and n-dimensional arrays. From the numpy.dot
's online documentation:
For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
So for 2-D arrays you are always guaranteed to have one call to BLAS's dgemm
, however for N-D arrays numpy might choose the multiplication axes for arrays which might not correspond to the fastest changing axis (as you can see from the excerpt I have posted), and as result the full power of dgemm
could be missed out on.
2) Your A
array is too large to be loaded on to CPU cache. In your example, you use A
with dimensions (10,1000,1000)
which gives
In[1]: A.nbytes80000000In[2]: 80000000/102478125
That is almost 80MB
, much larger than your cache size. So again you lose most of dgemm
's power right there.
3) You are also timing the functions somewhat imprecisely. The time
function in Python is known to be not precise. Use timeit
instead.
So having all the above points in mind, let's try experimenting with arrays that can be loaded on to the cache
dim1, dim2, dim3 = 20, 20, 20
A = np.random.rand(dim1, dim2, dim2)
x = np.random.rand(dim2, dim3)
deffor_dot1(A,x):
for i inrange(A.shape[0]):
np.dot(A[i,:,:], x)
deffor_dot2(A,x):
for i inrange(A.shape[0]):
np.dot(A[:,i,:], x)
deffor_dot3(A,x):
for i inrange(A.shape[0]):
np.dot(A[:,:,i], x)
and here are the timings that I get (using numpy 1.9.2
built against OpenBLAS 0.2.14
):
In [3]: %timeit np.dot(A,x)
10000 loops, best of 3: 174 µs per loop
In [4]: %timeit np.einsum("ijk, kl -> ijl", A, x)
10000 loops, best of 3: 108 µs per loop
In [5]: %timeit np.einsum("ijk, lk -> ijl", A, x)
10000 loops, best of 3: 97.1 µs per loop
In [6]: %timeit np.einsum("ikj, kl -> ijl", A, x)
1000 loops, best of 3: 238 µs per loop
In [7]: %timeit np.einsum("kij, kl -> ijl", A, x)
10000 loops, best of 3: 113 µs per loop
In [8]: %timeit for_dot1(A,x)
10000 loops, best of 3: 101 µs per loop
In [9]: %timeit for_dot2(A,x)
10000 loops, best of 3: 131 µs per loop
In [10]: %timeit for_dot3(A,x)
10000 loops, best of 3: 133 µs per loop
Notice that there is still a time difference, but not in orders of magnitude. Also note the importance of choosing the axis of multiplication
. Now perhaps, a numpy developer can shed some light on what numpy.dot
actually does under the hood for N-D arrays.
Post a Comment for "Why Is B = Numpy.dot(a,x) So Much Slower Looping Through Doing B[i,:,:] = Numpy.dot(a[i,:,:],x) )?"