Why Is B = Numpy.dot(a,x) So Much Slower Looping Through Doing B[i,:,:] = Numpy.dot(a[i,:,:],x) )?

April 21, 2024 Post a Comment

I'm getting some efficiency test results that I can't explain. I want to assemble a matrix B whose i-th entries B[i,:,:] = A[i,:,:].dot(x), where each A[i,:,:] is a 2D matrix, and

Solution 1:

With smaller dims 10,100,200, I get a similar ranking

In [355]: %%timeit
   .....: B=np.zeros((N,M,L))
   .....: for i in range(N):
              B[i,:,:]=np.dot(A[i,:,:],x)
   .....: 
10 loops, best of 3: 22.5 ms per loop
In [356]: timeit np.dot(A,x)
10 loops, best of 3: 44.2 ms per loop
In [357]: timeit np.einsum('ijk,km->ijm',A,x)
10 loops, best of 3: 29 ms per loop

In [367]: timeit np.dot(A.reshape(-1,M),x).reshape(N,M,L)
10 loops, best of 3: 22.1 ms per loop

In [375]: timeit np.tensordot(A,x,(2,0))
10 loops, best of 3: 22.2 ms per loop

the itererative is faster, though not by as much as in your case.

This is probably true as long as that iterating dimension is small compared to the other ones. In that case the overhead of iteration (function calls etc) is small compared to the calculation time. And doing all the values at once uses more memory.

I tried a dot variation where I reshaped A into 2d, thinking that dot does that kind of reshaping internally. I'm surprised that it is actually fastest. tensordot is probably doing the same reshaping (that code if Python readable).

einsum sets up a 'sum of products' iteration involving 4 variables, the i,j,k,m - that is dim1*dim2*dim2*dim3 steps with the C level nditer. So the more indices you have the larger the iteration space.

Solution 2:

numpy.dot only delegates to a BLAS matrix multiply when the inputs each have dimension at most 2:

#if defined(HAVE_CBLAS)if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
            (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
             NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
        return cblas_matrixproduct(typenum, ap1, ap2, out);
    }
#endif

When you stick your whole 3-dimensional A array into dot, NumPy takes a slower path, going through an nditer object. It still tries to get some use out of BLAS in the slow path, but the way the slow path is designed, it can only use a vector-vector multiply rather than a matrix-matrix multiply, which doesn't give the BLAS anywhere near as much room to optimize.

Solution 3:

I am not too familiar with numpy's C-API, and the numpy.dot is one such builtin function that used to be under _dotblas in earlier versions.

Nevertheless, here are my thoughts.

1)numpy.dot takes different paths for 2-dimensional arrays and n-dimensional arrays. From the numpy.dot's online documentation:

For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b
dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])

So for 2-D arrays you are always guaranteed to have one call to BLAS's dgemm, however for N-D arrays numpy might choose the multiplication axes for arrays which might not correspond to the fastest changing axis (as you can see from the excerpt I have posted), and as result the full power of dgemm could be missed out on.

2) Your A array is too large to be loaded on to CPU cache. In your example, you use A with dimensions (10,1000,1000) which gives

In[1]: A.nbytes80000000In[2]: 80000000/102478125

That is almost 80MB, much larger than your cache size. So again you lose most of dgemm's power right there.

3) You are also timing the functions somewhat imprecisely. The time function in Python is known to be not precise. Use timeit instead.

So having all the above points in mind, let's try experimenting with arrays that can be loaded on to the cache

dim1, dim2, dim3 = 20, 20, 20
A = np.random.rand(dim1, dim2, dim2)
x = np.random.rand(dim2, dim3)

deffor_dot1(A,x):
    for i inrange(A.shape[0]):
        np.dot(A[i,:,:], x)

deffor_dot2(A,x):
    for i inrange(A.shape[0]):
        np.dot(A[:,i,:], x)    

deffor_dot3(A,x):
    for i inrange(A.shape[0]):
        np.dot(A[:,:,i], x)

and here are the timings that I get (using numpy 1.9.2 built against OpenBLAS 0.2.14):

In [3]: %timeit np.dot(A,x)
10000 loops, best of 3: 174 µs per loop
In [4]: %timeit np.einsum("ijk, kl -> ijl", A, x)
10000 loops, best of 3: 108 µs per loop
In [5]: %timeit np.einsum("ijk, lk -> ijl", A, x)
10000 loops, best of 3: 97.1 µs per loop
In [6]: %timeit np.einsum("ikj, kl -> ijl", A, x)
1000 loops, best of 3: 238 µs per loop
In [7]: %timeit np.einsum("kij, kl -> ijl", A, x)
10000 loops, best of 3: 113 µs per loop
In [8]: %timeit for_dot1(A,x)
10000 loops, best of 3: 101 µs per loop
In [9]: %timeit for_dot2(A,x)
10000 loops, best of 3: 131 µs per loop
In [10]: %timeit for_dot3(A,x)
10000 loops, best of 3: 133 µs per loop

Notice that there is still a time difference, but not in orders of magnitude. Also note the importance of choosing the axis of multiplication. Now perhaps, a numpy developer can shed some light on what numpy.dot actually does under the hood for N-D arrays.

Python Programming Language

Why Is B = Numpy.dot(a,x) So Much Slower Looping Through Doing B[i,:,:] = Numpy.dot(a[i,:,:],x) )?

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Why Is B = Numpy.dot(a,x) So Much Slower Looping Through Doing B[i,:,:] = Numpy.dot(a[i,:,:],x) )?"