Appending Blank Rows To Dataframe If Column Does Not Exist
This question is kind of odd and complex, so bear with me, please. I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data co
Solution 1:
I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.
Assuming some kind of master list all_cols_to_use, can you do something like:
defparse_big_csv(csvpath):
withopen(csvpath, 'r') as infile:
header = infile.readline().strip().split(',')
cols_to_use = sorted(set(header) & set(all_cols_to_use))
missing_cols = sorted(set(all_cols_to_use) - set(header))
df = pd.read_csv(csvpath, usecols=cols_to_use)
df.loc[:, missing_cols] = np.nan
return df
This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)
Post a Comment for "Appending Blank Rows To Dataframe If Column Does Not Exist"