Skip to content Skip to sidebar Skip to footer

Why Is Get_group So Slow In Pandas?

I have a csv file with 400.000 rows and 15 columns. I have to make multiple filter operations for each row. So, I thought to use pandas and groupby to try to improve the performanc

Solution 1:

You would have to show data to prove this. get_group is quite fast. The first iteration DOES do some caching, but it is minimal (sorting of the data is irrelevant)

N = 1000000 

In [4]: df = DataFrame(dict(A = np.random.randint(0,1000,size=N),B=np.random.randint(0,1000,size=N),C=np.random.randn(N)))

In [5]: %timeit df.groupby(['A','B'])
10000 loops, best of 3: 84.2 µs per loop

In [6]: g = df.groupby(['A','B'])

In [7]: %timeit -n 1 g.get_group((100,100))
1 loops, best of 3: 2.86 ms per loop

Further, you should not repeatedly using get_group, instead use the cythonized functions, apply, or iteration, see docs here

Solution 2:

Instead of using, get_group(), you should use filtering (like df[(df.Year == '2014') & (df.Team == 'Barcelona')]). This is extremely fast and performs the same operation. Here's a detailed comparison of the two.

In [1]: df = DataFrame(dict(A = np.random.randint(0,1000,size=N),B=np.random.randint(0,1000,size=N),C=np.random.randn(N)))

In [2]: %time df.groupby(['A','B'])
CPU times: user 0 ns, sys: 804 µs, total: 804 µs
Wall time: 802 µs

In [3]: g = df.groupby(['A','B'])

In [4]: %time g.get_group((100,100))
CPU times: user 1.47 s, sys: 93.8 ms, total: 1.56 s
Wall time: 1.57 s
        A   B   C
3256011001001.547365837535100100 -0.058478

In [5]: %time df[(df.A == 100) & (df.B == 100)]
CPU times: user 12.6 ms, sys: 317 µs, total: 12.9 ms
Wall time: 21.3 ms
        A   B   C
3256011001001.547365837535100100 -0.058478

This is a speed up of more than 70x. Moreover, filtering is the right way to access rows by column value and not groupby!

Solution 3:

Instead of using the get_group method, i.e.: -

grouped = df.groupby("the_column_you_want")
grouped.get_group("the_group_you_want")

You can use: -

grouped = df.groupby("the_column_you_want")
for name,groupin grouped:
  if name == "the_group_you_want":
    print(group)      

It will be equivalent to the get_group function and but will compute a lot faster.

Post a Comment for "Why Is Get_group So Slow In Pandas?"