Skip to content Skip to sidebar Skip to footer

Read Dataframe Split By Nan Rows And Extract Specific Columns In Python

I have a example excel file data2.xlsx from here, which has a Sheet1 as follows: Preprocess: The columns 2018, 2019, 2020, num are object type, which I need to convert to float: c

Solution 1:

*note I use column indices when the column name is not certain

You can split tables with

df['city'] = df.groupby(df.iloc[:, 0].isna().cumsum()).transform(first)
df.dropna(subset=df.columns[0], inplace=True)
df = df.loc[df[df.colmns[0]] != df.city]

Now df will have an additional column city with the table title, while the title and empty rows have been discarded. You can access any part of that city column with .str.split.str.get

df.city = df.city.str.split('-').str.get(1)

Finally you want to keep just the num column, which is the easiest step

df = df.iloc[:, [0, 4, 5]]
df = df.pivot(index='city', columns=df.columns[0], values=df.columns[1])

Solution 2:

My code based on jezrael's great answer, welcome to share better solution or improve it:

# add header=None for default columns namesdf = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header=None)

# convert columns by second row
df.columns = df.iloc[1].rename(None)

# create new column `city` by forward filling non missing values by second column
df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())

pattern = '|'.join(['2019-', '-price-quantity'])
df['city'] = df['city'].str.replace(pattern, '')
df['year'] = df['year'].str.replace(pattern, '')
# convert floats to integers 
df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]
df = df[df.year.isin(['price', 'quantity'])]
df = df[['city', 'year', 'num']]
df['num'] = df['num'].replace('--', np.nan, regex=True).astype(float)
df = df.set_index(['city', 'year']).unstack().reset_index()
df.columns = df.columns.droplevel(0)
df.rename({'year': 'city'}, axis=1, inplace=True)
print(df)

Out:

year      price  quantity
0     bj   21.010.01     gz    6.015.02     sh   12.0NaN3     sz   13.0NaN

Post a Comment for "Read Dataframe Split By Nan Rows And Extract Specific Columns In Python"