Skip to content Skip to sidebar Skip to footer

Memory Efficient Lda Training Using Gensim Library

Today I just started writing an script which trains LDA models on large corpora (minimum 30M sentences) using gensim library. Here is the current code that I am using: from gensim

Solution 1:

Consider wrapping your corpus up as an iterable and passing that instead of a list (a generator will not work).

From the tutorial:

classMyCorpus(object):
    def__iter__(self):
       for line inopen(fname):
            # assume there's one document per line, tokens separated by whitespaceyield dictionary.doc2bow(line.lower().split())

corpus = MyCorpus()
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                      id2word=dictionary,
                                      num_topics=100,
                                      update_every=1,
                                      chunksize=10000,
                                      passes=1)

Additionally, Gensim has several different corpus formats readily available, which can be found in the API reference. You might consider using TextCorpus, which should fit your format nicely already:

corpus = gensim.corpora.TextCorpus(fname)
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, 
                                      id2word=corpus.dictionary, # TextCorpus can build the dictionary for younum_topics=100,
                                      update_every=1,
                                      chunksize=10000,
                                      passes=1)

Post a Comment for "Memory Efficient Lda Training Using Gensim Library"