Pandas Read_csv() 1.2gb File Out Of Memory On Vm With 140gb Ram
Solution 1:
This sounds like a job for chunksize
. It splits the input process into multiple chunks, reducing the required reading memory.
df = pd.DataFrame()
for chunk in pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000):
df = pd.concat([df, chunk], ignore_index=True)
Solution 2:
This error can occur with an invalid csv file, rather than the stated memory error.
I got this error with a file that was much smaller than my available RAM and it turned out that there was an opening double quote on one line without a closing double quote.
In this case, you can check the data, or you can change the quoting behavior of the parser, for example by passing quoting=3
to pd.read_csv
.
Solution 3:
This is weird.
Actually I ran into the same situation.
df_train = pd.read_csv('./train_set.csv')
But after I tried a lot of stuff to solve this error. And it works. Like this:
dtypes = {'id': pd.np.int8,
'article':pd.np.str,
'word_seg':pd.np.str,
'class':pd.np.int8}
df_train = pd.read_csv('./train_set.csv', dtype=dtypes)
df_test = pd.read_csv('./test_set.csv', dtype=dtypes)
Or this:
ChunkSize = 10000
i = 1
for chunk in pd.read_csv('./train_set.csv', chunksize=ChunkSize): #分块合并
df_train = chunk if i == 1 else pd.concat([df_train, chunk])
print('-->Read Chunk...', i)
i += 1
BUT!!!!!Suddenlly the original version works fine as well!
Like I did some useless work and I still have no idea where really went wrong.
I don't know what to say.
Solution 4:
You can use the command df.info(memory_usage="deep")
, to find out the memory usage of data being loaded in the data frame.
Few things to reduce Memory:
- Only load columns you need in the processing via
usecols
table. - Set
dtypes
for these columns - If your dtype is Object / String for some columns, you can try using the
dtype="category"
. In my experience it reduced the memory usage drastically.
Solution 5:
I used the below code to load csv in chunks while removing the intermediate file to manage memory, and view % of loading in real time: Note: 96817414 is the number of rows in my csv
import pandas as pd
import gc
cols=['col_name_1', 'col_name_2', 'col_name_3']
df = pd.DataFrame()
i = 0for chunk in pd.read_csv('file.csv', chunksize=100000, usecols=cols):
df = pd.concat([df, chunk], ignore_index=True)
del chunk; gc.collect()
i+=1if i%5==0:
print("% of read completed", 100*(i*100000/96817414))
Post a Comment for "Pandas Read_csv() 1.2gb File Out Of Memory On Vm With 140gb Ram"