Skip to content Skip to sidebar Skip to footer

Retrieve String Version Of Document By Id In Gensim

I am using Gensim for some topic modelling and I have gotten to the point where I am doing similarity queries using the LSI and tf-idf models. I get back the set of IDs and similar

Solution 1:

Sadly, as far as I can tell, you have to start from the very beginning of the analysis knowing that you'll want to retrieve documents by the ids. This means you need to create your own mapping between ids and the original documents and make sure the ids gensim uses are preserved throughout the process. As is, I don't think gensim keeps such a mapping handy.

I could definitely be wrong, and in fact I'd love it if someone tells me there is an easier way, but I spent many hours trying to avoid re-running a gigantic LSI model on a wikipedia corpus to no avail. Eventually I had to carry along a list of ids and the associated documents so I could use gensim's output.

Solution 2:

I have just gone through the same process and reached the same point of having "sims" with a document ID but wanting my original "article code". Although it's not provided entirely, there is a metadata feature throughout the Gensim library and the examples which can help. I'll answer this while I remember what I had to do, in case it helps any future visitors to this old question.

See gensim.corpora.textcorpus.TextCorpus#get_texts, which either returns the text or a simple single item of metadata "linenumber" if the metadata flag is enabled:

defget_texts(self):
    """Iterate over the collection, yielding one document at a time. A document
    is a sequence of words (strings) that can be fed into `Dictionary.doc2bow`.
    Each document will be fed through `preprocess_text`. That method should be
    overridden to provide different preprocessing steps. This method will need
    to be overridden if the metadata you'd like to yield differs from the line
    number.
    Returns:
        generator of lists of tokens (strings); each list corresponds to a preprocessed
        document from the corpus `input`.
    """
    lines = self.getstream()
    if self.metadata:
        for lineno, line inenumerate(lines):
            yield self.preprocess_text(line), (lineno,)
    else:
        for line in lines:
            yield self.preprocess_text(line)

I had already implemented a custom make_corpus.py script, and a trial classifier script which uses similarity to find related documents to a search document. The changes I made to utilise the metadata from that point were as follows:

In the make_corpus script, I enabled metadata in the constructor to my TextCorpus daughter class:

corpus = SysRevArticleCorpus(inp, lemmatize=lemmatize, metadata=True)

I also needed to serialise the metadata, as I'm not doing the processing immediately after corpus generation (as some of the examples do), so you need to turn on metadata in the serialise step too:

MmCorpus.serialize(outp + '_bow.mm', corpus, progress_cnt=10000, metadata=True)

This makes gensim.matutils.MmWriter#write_corpus save a “xxx_bow.mm.metadata.cpickle” file with your corpus .mm files.

To add more items into the metadata, you need to implement and override a few things in a TextCorpus daughter class. I already had based one off the WikiCorpus example class, as I have my own existing corpus to read.

The constructor needs to receive the metadata flag e.g.:

def __init__(self, fname, processes=None, lemmatize=utils.has_pattern(), 
    dictionary=None, metadata=False,
...
    self.metadata = metadata

    if dictionary is None:
        # temporarily disable metadata to make internal dict
        metadata_setting = self.metadata
        self.metadata = Falseself.dictionary = Dictionary(self.get_texts())
        self.metadata = metadata_setting
    else:
        self.dictionary = dictionary

I'm actually reading in from a JSON corpus so I'd already written a custom parser. My articles have a "code" property which is my canonical document ID. I also want to store the "title", and the document body is in the "text" property. (This replaces the XML parsing in the wiki example).

defextract_articles(f, filter_namespaces=False):
    """
    Extract article from a SYSREV article export JSON = open file-like object `f`.

    Return an iterable over (str, str, str) which generates (title, content, pageid) triplets.
    """
    elems = (elem for elem in f)
    for elem in elems:
        yield elem["title"], elem["text"] or"", elem["code"]

This is called from within the overridden get_texts (in the parent class it mentions you need to override this to use custom metadata). Summarised:

defget_texts(self):
...
    withopen(self.fname) as data_file:    
        corpusdata = json.load(data_file)
    texts = \
        ((text, self.lemmatize, title, pageid)
         for title, text, pageid
         in extract_articles(corpusdata['docs'], self.filter_namespaces))

... (skipping pool processing stuff for clarity)

    for tokens, title, pageid in pool.imap(process_article, group):

        if self.metadata:
            yield (tokens, (pageid, title))
        else:
            yield tokens

So this should get you saving metadata along side your corpus.mm files. When you want to re-read this in a later script, you will need to read the pickle file back in - there doesn't seem to be any built in methods to re-read the metadata. Fortunately it's just a Dictionary indexed by the Gensim-generated document ID, so it's easy to load and use. (See wiki-sim-search)

e.g. in my trial classifier, I just added two things: metadata = pickle.load() and metadata[docID] to finally find the original article.

# re-load everything...
dictionary = corpora.Dictionary.load_from_text(datapath+'/en_wordids.txt')
  corpus = corpora.MmCorpus(datapath +'/xxx_bow.mm')
metadata = pickle.load(open(datapath + 'xxx_bow.mm.metadata.cpickle', 'rb'))

lsiModel = models.LsiModel(corpus, id2word=dictionary, num_topics=4)
index = similarities.MatrixSimilarity(lsiModel[corpus])

# example search
doc = "electronic cognitive simulation"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsiModel[vec_bow]  # convert the query to LSI space# perform a similarity query against the corpus
sims = index[vec_lsi]  
sims = sorted(enumerate(sims), key=lambda item: -item[1])

# Look up the original article metadata for the top hit
(docID, prob) = sims[0]
print(metadata[docID])

# Prints (CODE, TITLE)
('ShiShani2008ProCarNur', 'Jordanian nurses and physicians learning needs for promoting smoking cessation.')

I know this doesn't provide the original text as you asked (I don't need it myself), but you could very easily add the text to the "metadata" (although this rather stretches the definition of metadata and could be very big!). I guess Gensim presumes you will already have some database of your original documents, and therefore it would be out of scope. However I feel there needs to be a mapping between the Gensim-generated IDs and the original document identifiers, which the metadata feature fulfils quite well.

Post a Comment for "Retrieve String Version Of Document By Id In Gensim"