Skip to content Skip to sidebar Skip to footer

Appending Blank Rows To Dataframe If Column Does Not Exist

This question is kind of odd and complex, so bear with me, please. I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data co

Solution 1:

I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.

Assuming some kind of master list all_cols_to_use, can you do something like:

defparse_big_csv(csvpath):
    withopen(csvpath, 'r') as infile:
        header = infile.readline().strip().split(',')
        cols_to_use = sorted(set(header) & set(all_cols_to_use))
        missing_cols = sorted(set(all_cols_to_use) - set(header))
    df = pd.read_csv(csvpath, usecols=cols_to_use)
    df.loc[:, missing_cols] = np.nan
    return df

This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)

Post a Comment for "Appending Blank Rows To Dataframe If Column Does Not Exist"