Skip to content Skip to sidebar Skip to footer

Remove Level And All Of Its Rows From Pandas Dataframe If One Row Meets Condition

Below is a pandas dataframe that I would like to filter. I would like to remove the year and all of its rows when the temp for at least one row (i.e., visit) in that year is < 3

Solution 1:

You could use groupby/filter to remove groups based on a condition:

import numpy as np
import pandas as pd

index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['yr', 'visit'])
columns = pd.MultiIndex.from_product([['hr', 'temp']], names=['metric'])
data = pd.DataFrame([[96, 38], [98, 38], [85, 36], [84, 43]], index=index, columns=columns)

print(data.groupby(level='yr').filter(lambda x: (x['temp']>=37).all()))

yields

metric      hr temp
yr   visit         
2013 1      96   38
     2      98   38

Since the rows you wish to remove are grouped by yr and the yr is a level of the index, use groupby(level='yr'). For each group the lambda function is called with x set to the sub-DataFrame group. The group is kept when (x['temp']>=37).all()) is True.


Note that Wen's suggestion,

data.loc[(data['temp']>=37).groupby(level='yr').transform('all')]

is faster, particularly for large DataFrames, since data['temp']>=37 computes the criterion in a vectorized way for the entire column whereas in my solution above, (x['temp']>=37).all() computes the criterion in a piecemeal fashion for each sub-DataFrame separately. Generally, vectorized solutions are faster when applied to large arrays or NDFrames instead of in a loop on smaller pieces.

Here is an example showing the difference in speed for a 1000-row DataFrame:

In [70]: df = pd.DataFrame(np.random.randint(100, size=(1000, 4)), columns=list('ABCD')).set_index(['A','B'])

In [71]: %timeit df.groupby(level='A').filter(lambda x: (x['C']>=5).all())
10 loops, best of 3: 46.3 ms per loop

In [72]: %timeit df.loc[(df['C']>=37).groupby(level='A').transform('all')]
100 loops, best of 3: 18.9 ms per loop

Solution 2:

Using .loc:

import pandas as pd

index = pd.MultiIndex.from_product(
  [[2013, 2014], [1, 2]], names=['yr', 'visit'])

columns = pd.MultiIndex.from_product([['hr', 'temp']], names=['metric'])

data = pd.DataFrame([[96, 38], [98, 38], [85, 36], [84, 43]], 
                    index=index, columns=columns)

data.loc[[2013]]

Gives:

metric      hr  temp
yr   visit
2013 1      96    38
     2      98    38

Post a Comment for "Remove Level And All Of Its Rows From Pandas Dataframe If One Row Meets Condition"