Functional Programming: Numpy Vectorizable Function To Create An Expected Values Array
Solution 1:
You can do this with numpy built-ins using broadcasting. Broadcasting allows you to add together two arrays of different shapes without making excessive copies or looping excessively.
We can solve your problem by creating two vectors representing the row and column sums respectively, and 'multiplying' them together, which will broadcast them into a correctly sized and shaped array.
The best introduction to this topic I know of is the talk Losing Your Loops: Fast Numerical Computation with Numpy by Jake Vanderplass. It contains visual examples that I find essential for wrapping your head around broadcasting.
Here's a simple example:
IN
import numpy as np
a = np.arange(3)
b = np.reshape(np.arange(3), [3, 1])
print('a = ', a)
print('b = ')
print(b)
print('a+b = ')
print(a+b)
OUT:
a = [012]
b =
[[0]
[1]
[2]]
a+b =
[[0 1 2]
[1 2 3]
[2 3 4]]
We can solve your problem by creating two vectors representing the row and column sums respectively 'multiplying' them together, broadcasting them into a correctly sized and shaped array.
import numpy as np
defgen_expected(array: np.ndarray):
col_sums = (np.sum(array, axis=0))
row_sums = np.sum(array, axis=1)
np.reshape(row_sums, [len(row_sums), 1])
return (col_sums * row_sums) / np.sum(array)
# NOTE: this result might be transposed! Check it yourself!
Solution 2:
The function scipy.stats.chi2_contingency
will compute the expected
array for you. For example,
In [303]: from scipy.stats import chi2_contingency
In [304]: a = np.array([[3, 5, 10], [2, 4, 16]])
In [305]: chi2, p, dof, expected = chi2_contingency(a)
In [306]: expected
Out[306]:
array([[ 2.25, 4.05, 11.7 ],
[ 2.75, 4.95, 14.3 ]])
If you only want the expected
array, you can use scipy.stats.contingency.expected_freq
:
In [307]: from scipy.stats.contingency import expected_freq
In [308]: expected_freq(a)
Out[308]:
array([[ 2.25, 4.05, 11.7 ],
[ 2.75, 4.95, 14.3 ]])
To see how expected_freq
computes the result, you can see the source code here: https://github.com/scipy/scipy/blob/master/scipy/stats/contingency.py
You'll see that the code is vectorized; the only explicit loop is over the number of dimensions of the input array (in the function margins(a)
).
Solution 3:
So with a sample array:
In [147]: arr = np.arange(16).reshape(4,4)
and your code:
In [148]: def gen_expected(a_array):
...: new_array = np.zeros(a_array.shape,dtype=float)
...: for ri,r in enumerate(a_array):
...: for ci,c in enumerate(a_array.T):
...: new_array[ri,ci]=(c.sum()*r.sum()/a_array.sum())
...: return new_array
...:
In [149]: gen_expected(arr)
Out[149]:
array([[ 1.2 , 1.4 , 1.6 , 1.8 ],
[ 4.4 , 5.13333333, 5.86666667, 6.6 ],
[ 7.6 , 8.86666667, 10.13333333, 11.4 ],
[ 10.8 , 12.6 , 14.4 , 16.2 ]])
The only element level, or scalar, step in your function is a 3 term product and division:
In [151]: defchi0(csum, rsum, asum):
...: return csum*rsum/asum
which we can 'wrap' in vectorize:
In [152]: fchi0 = np.vectorize(chi0, otypes=[float])
vectorize
does a nice job of broadcasting arrays against each other, and feeding the results to your function. In effect it can perform the row and column enumerations. We just need to take the relevant sums:
In [153]: fchi0(arr.sum(axis=1,keepdims=True), arr.sum(axis=0,keepdims=True), arr.sum())
Out[153]:
array([[ 1.2 , 1.4 , 1.6 , 1.8 ],
[ 4.4 , 5.13333333, 5.86666667, 6.6 ],
[ 7.6 , 8.86666667, 10.13333333, 11.4 ],
[ 10.8 , 12.6 , 14.4 , 16.2 ]])
But with those 3 sums, I don't need to use the vectorize
intermediary:
In [154]: arr.sum(axis=1,keepdims=True)* arr.sum(axis=0,keepdims=True) / arr.sum()
Out[154]:
array([[ 1.2 , 1.4 , 1.6 , 1.8 ],
[ 4.4 , 5.13333333, 5.86666667, 6.6 ],
[ 7.6 , 8.86666667, 10.13333333, 11.4 ],
[ 10.8 , 12.6 , 14.4 , 16.2 ]])
In [155]: arr.sum(axis=1,keepdims=True), arr.sum(axis=0,keepdims=True), arr.sum()
Out[155]:
(array([[ 6],
[22],
[38],
[54]]),
array([[24, 28, 32, 36]]),
120)
The key to the broadcasting is creating a column vector, a row vector, and total sum (scalar). One is (4,1) in shape, the other (1,4). The keepdims
parameter keeps the 2d shape. Without it we'd have to add a dimension to the first. (4,) broadcasts to (1,4), but to (4,1) requires explicit ok.
When generating indices, e.g. np.arange(4)
, np.ix_
is a handy tool
In [156]: np.ix_(arr.sum(axis=1), arr.sum(axis=0))
Out[156]:
(array([[ 6],
[22],
[38],
[54]]), array([[24, 28, 32, 36]]))
And before keepdims
was added, adding dimensions with np.newaxis
was - and still is - a favorite:
In [157]: arr.sum(axis=1)[:,None] * arr.sum(axis=0) / arr.sum()
Out[157]:
array([[ 1.2 , 1.4 , 1.6 , 1.8 ],
[ 4.4 , 5.13333333, 5.86666667, 6.6 ],
[ 7.6 , 8.86666667, 10.13333333, 11.4 ],
[ 10.8 , 12.6 , 14.4 , 16.2 ]])
More generally ix_
is recommended when generating indices from ranges;
In [160]: I,J = np.ix_(range(10,40,10), range(1,5))
In [161]: I+J
Out[161]:
array([[11, 12, 13, 14],
[21, 22, 23, 24],
[31, 32, 33, 34]])
Post a Comment for "Functional Programming: Numpy Vectorizable Function To Create An Expected Values Array"