Appending Blank Rows To Dataframe If Column Does Not Exist
This question is kind of odd and complex, so bear with me, please. I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data co
Solution 1:
I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
Assuming some kind of master list all_cols_to_use
, can you do something like:
defparse_big_csv(csvpath):
withopen(csvpath, 'r') as infile:
header = infile.readline().strip().split(',')
cols_to_use = sorted(set(header) & set(all_cols_to_use))
missing_cols = sorted(set(all_cols_to_use) - set(header))
df = pd.read_csv(csvpath, usecols=cols_to_use)
df.loc[:, missing_cols] = np.nan
return df
This assumes that you're okay with filling the missing columns with np.nan
, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan
as appropriate.)
Post a Comment for "Appending Blank Rows To Dataframe If Column Does Not Exist"