Why Is Get_group So Slow In Pandas?
Solution 1:
You would have to show data to prove this. get_group
is quite fast. The first iteration DOES do some caching, but it is minimal (sorting of the data is irrelevant)
N = 1000000
In [4]: df = DataFrame(dict(A = np.random.randint(0,1000,size=N),B=np.random.randint(0,1000,size=N),C=np.random.randn(N)))
In [5]: %timeit df.groupby(['A','B'])
10000 loops, best of 3: 84.2 µs per loop
In [6]: g = df.groupby(['A','B'])
In [7]: %timeit -n 1 g.get_group((100,100))
1 loops, best of 3: 2.86 ms per loop
Further, you should not repeatedly using get_group
, instead use the cythonized functions, apply
, or iteration, see docs here
Solution 2:
Instead of using, get_group()
, you should use filtering (like df[(df.Year == '2014') & (df.Team == 'Barcelona')]
). This is extremely fast and performs the same operation. Here's a detailed comparison of the two.
In [1]: df = DataFrame(dict(A = np.random.randint(0,1000,size=N),B=np.random.randint(0,1000,size=N),C=np.random.randn(N)))
In [2]: %time df.groupby(['A','B'])
CPU times: user 0 ns, sys: 804 µs, total: 804 µs
Wall time: 802 µs
In [3]: g = df.groupby(['A','B'])
In [4]: %time g.get_group((100,100))
CPU times: user 1.47 s, sys: 93.8 ms, total: 1.56 s
Wall time: 1.57 s
A B C
3256011001001.547365837535100100 -0.058478
In [5]: %time df[(df.A == 100) & (df.B == 100)]
CPU times: user 12.6 ms, sys: 317 µs, total: 12.9 ms
Wall time: 21.3 ms
A B C
3256011001001.547365837535100100 -0.058478
This is a speed up of more than 70x. Moreover, filtering is the right way to access rows by column value and not groupby!
Solution 3:
Instead of using the get_group method, i.e.: -
grouped = df.groupby("the_column_you_want")
grouped.get_group("the_group_you_want")
You can use: -
grouped = df.groupby("the_column_you_want")
for name,groupin grouped:
if name == "the_group_you_want":
print(group)
It will be equivalent to the get_group function and but will compute a lot faster.
Post a Comment for "Why Is Get_group So Slow In Pandas?"