Read Dataframe Split By Nan Rows And Extract Specific Columns In Python
I have a example excel file data2.xlsx from here, which has a Sheet1 as follows: Preprocess: The columns 2018, 2019, 2020, num are object type, which I need to convert to float: c
Solution 1:
*note I use column indices when the column name is not certain
You can split tables with
df['city'] = df.groupby(df.iloc[:, 0].isna().cumsum()).transform(first)
df.dropna(subset=df.columns[0], inplace=True)
df = df.loc[df[df.colmns[0]] != df.city]
Now df
will have an additional column city
with the table title, while the title and empty rows have been discarded. You can access any part of that city
column with .str.split.str.get
df.city = df.city.str.split('-').str.get(1)
Finally you want to keep just the num
column, which is the easiest step
df = df.iloc[:, [0, 4, 5]]
df = df.pivot(index='city', columns=df.columns[0], values=df.columns[1])
Solution 2:
My code based on jezrael's great answer, welcome to share better solution or improve it:
# add header=None for default columns namesdf = pd.read_excel('./data2.xlsx', sheet_name = 'Sheet1', header=None)
# convert columns by second row
df.columns = df.iloc[1].rename(None)
# create new column `city` by forward filling non missing values by second column
df.insert(0, 'city', df.iloc[:, 0].mask(df.iloc[:, 1].notna()).ffill())
pattern = '|'.join(['2019-', '-price-quantity'])
df['city'] = df['city'].str.replace(pattern, '')
df['year'] = df['year'].str.replace(pattern, '')
# convert floats to integers
df.columns = [int(x) if isinstance(x, float) else x for x in df.columns]
df = df[df.year.isin(['price', 'quantity'])]
df = df[['city', 'year', 'num']]
df['num'] = df['num'].replace('--', np.nan, regex=True).astype(float)
df = df.set_index(['city', 'year']).unstack().reset_index()
df.columns = df.columns.droplevel(0)
df.rename({'year': 'city'}, axis=1, inplace=True)
print(df)
Out:
year price quantity
0 bj 21.010.01 gz 6.015.02 sh 12.0NaN3 sz 13.0NaN
Post a Comment for "Read Dataframe Split By Nan Rows And Extract Specific Columns In Python"