Skip to content Skip to sidebar Skip to footer

Functional Programming: Numpy Vectorizable Function To Create An Expected Values Array

I want to run a chi-squared statistical test against some categorical data counts - but to do that, I need to calculate the expected values for each cell of an array matching my ob

Solution 1:

You can do this with numpy built-ins using broadcasting. Broadcasting allows you to add together two arrays of different shapes without making excessive copies or looping excessively.

We can solve your problem by creating two vectors representing the row and column sums respectively, and 'multiplying' them together, which will broadcast them into a correctly sized and shaped array.

The best introduction to this topic I know of is the talk Losing Your Loops: Fast Numerical Computation with Numpy by Jake Vanderplass. It contains visual examples that I find essential for wrapping your head around broadcasting.

Here's a simple example:

IN

import numpy as np
a = np.arange(3)
b = np.reshape(np.arange(3), [3, 1])
print('a = ', a)
print('b = ')
print(b)
print('a+b = ')
print(a+b)

OUT:

a = [012]
b =
[[0]
 [1]
 [2]]
a+b =
[[0 1 2]
 [1 2 3]
 [2 3 4]]

We can solve your problem by creating two vectors representing the row and column sums respectively 'multiplying' them together, broadcasting them into a correctly sized and shaped array.

import numpy as np
defgen_expected(array: np.ndarray):
    col_sums = (np.sum(array, axis=0))
    row_sums = np.sum(array, axis=1)
    np.reshape(row_sums, [len(row_sums), 1])
    return (col_sums * row_sums)  / np.sum(array)
# NOTE: this result might be transposed! Check it yourself!

Solution 2:

The function scipy.stats.chi2_contingency will compute the expected array for you. For example,

In [303]: from scipy.stats import chi2_contingency

In [304]: a = np.array([[3, 5, 10], [2, 4, 16]])

In [305]: chi2, p, dof, expected = chi2_contingency(a)

In [306]: expected
Out[306]: 
array([[  2.25,   4.05,  11.7 ],
       [  2.75,   4.95,  14.3 ]])

If you only want the expected array, you can use scipy.stats.contingency.expected_freq:

In [307]: from scipy.stats.contingency import expected_freq

In [308]: expected_freq(a)
Out[308]: 
array([[  2.25,   4.05,  11.7 ],
       [  2.75,   4.95,  14.3 ]])

To see how expected_freq computes the result, you can see the source code here: https://github.com/scipy/scipy/blob/master/scipy/stats/contingency.py

You'll see that the code is vectorized; the only explicit loop is over the number of dimensions of the input array (in the function margins(a)).

Solution 3:

So with a sample array:

In [147]: arr = np.arange(16).reshape(4,4)

and your code:

In [148]: def gen_expected(a_array):
     ...:     new_array = np.zeros(a_array.shape,dtype=float)
     ...:     for ri,r in enumerate(a_array):
     ...:         for ci,c in enumerate(a_array.T):
     ...:             new_array[ri,ci]=(c.sum()*r.sum()/a_array.sum())
     ...:     return new_array
     ...: 
In [149]: gen_expected(arr)
Out[149]: 
array([[  1.2       ,   1.4       ,   1.6       ,   1.8       ],
       [  4.4       ,   5.13333333,   5.86666667,   6.6       ],
       [  7.6       ,   8.86666667,  10.13333333,  11.4       ],
       [ 10.8       ,  12.6       ,  14.4       ,  16.2       ]])

The only element level, or scalar, step in your function is a 3 term product and division:

In [151]: defchi0(csum, rsum, asum):
     ...:     return csum*rsum/asum

which we can 'wrap' in vectorize:

In [152]: fchi0 = np.vectorize(chi0, otypes=[float])

vectorize does a nice job of broadcasting arrays against each other, and feeding the results to your function. In effect it can perform the row and column enumerations. We just need to take the relevant sums:

In [153]: fchi0(arr.sum(axis=1,keepdims=True), arr.sum(axis=0,keepdims=True), arr.sum())
Out[153]: 
array([[  1.2       ,   1.4       ,   1.6       ,   1.8       ],
       [  4.4       ,   5.13333333,   5.86666667,   6.6       ],
       [  7.6       ,   8.86666667,  10.13333333,  11.4       ],
       [ 10.8       ,  12.6       ,  14.4       ,  16.2       ]])

But with those 3 sums, I don't need to use the vectorize intermediary:

In [154]: arr.sum(axis=1,keepdims=True)* arr.sum(axis=0,keepdims=True) / arr.sum()

Out[154]: 
array([[  1.2       ,   1.4       ,   1.6       ,   1.8       ],
       [  4.4       ,   5.13333333,   5.86666667,   6.6       ],
       [  7.6       ,   8.86666667,  10.13333333,  11.4       ],
       [ 10.8       ,  12.6       ,  14.4       ,  16.2       ]])

In [155]: arr.sum(axis=1,keepdims=True), arr.sum(axis=0,keepdims=True), arr.sum()
Out[155]: 
(array([[ 6],
        [22],
        [38],
        [54]]), 
 array([[24, 28, 32, 36]]), 
 120)

The key to the broadcasting is creating a column vector, a row vector, and total sum (scalar). One is (4,1) in shape, the other (1,4). The keepdims parameter keeps the 2d shape. Without it we'd have to add a dimension to the first. (4,) broadcasts to (1,4), but to (4,1) requires explicit ok.

When generating indices, e.g. np.arange(4), np.ix_ is a handy tool

In [156]: np.ix_(arr.sum(axis=1), arr.sum(axis=0))
Out[156]: 
(array([[ 6],
        [22],
        [38],
        [54]]), array([[24, 28, 32, 36]]))

And before keepdims was added, adding dimensions with np.newaxis was - and still is - a favorite:

In [157]: arr.sum(axis=1)[:,None] * arr.sum(axis=0) / arr.sum()
Out[157]: 
array([[  1.2       ,   1.4       ,   1.6       ,   1.8       ],
       [  4.4       ,   5.13333333,   5.86666667,   6.6       ],
       [  7.6       ,   8.86666667,  10.13333333,  11.4       ],
       [ 10.8       ,  12.6       ,  14.4       ,  16.2       ]])

More generally ix_ is recommended when generating indices from ranges;

In [160]: I,J = np.ix_(range(10,40,10), range(1,5))
In [161]: I+J
Out[161]: 
array([[11, 12, 13, 14],
       [21, 22, 23, 24],
       [31, 32, 33, 34]])

Post a Comment for "Functional Programming: Numpy Vectorizable Function To Create An Expected Values Array"