How To Customize Pytorch Data
I am trying to make a customized Dataloader using pytorch. I've seen some codes like (omitted the class sorry.) def __init__(self, data_root, transform=None, training=True, return_
Solution 1:
First, you want to customize (overload) data.Dataset
and not data.DataLoader
which is perfectly fine for your use case.
What you can do, instead of loading all data to RAM, is to read and store "meta data" on __init__
and read one relevant csv file whnever you need to __getitem__
a specific entry.
A pseudo-code of your Dataset
will look something like:
class ManyCSVsDataset(data.Dataset):
def __init__(self, ...):
super(ManyCSVsDataset, self).__init__()
# store the paths for all csvs and the number of items in each one
self.metadata = ...
self.num_items = total_number_of_items
def __len__(self):
return self.num_items
def __getitem__(self, index):
# based on the index, use self.metadata to determine what csv file to open
with open(relevant_csv_file, 'r') as R:
# read from R the specific line matching item index
return item
This implementation is not efficient in the sense that it reads the same csv file over and over and does not cache anything. On the other hand, you can take advantage of data.DataLoader
's multi processing support to have many parallel sub-processes doing all these file access at the background while you actually use the data for training.
Post a Comment for "How To Customize Pytorch Data"