Can Python Xml ElementTree Parse A Very Large Xml File?

April 22, 2023 Post a Comment

I'm trying to parse a large file (> 2GB) of structured markup data and the memory is not enough for this.Which is the optimal way of XML parsing class for this condition.More de

Solution 1:

Check out the iterparse() function. A description of how you can use it to parse very large documents can be found here.

Solution 2:

Most DOM libraries - like ElementTree - build the entire Document Model in core. Traditionally, when your model is too large to fit into memory at once, you need to use a more stream-oriented parser like xml.sax.

This is often harder than you expect it should be, especially when used to higher-order operations like dealing with the entire DOM at once.

Is it possible that your xml document is rather simple like

<entries>
  <entry>...</entry>
  <entry>...</entry>
</entries>

which would allow you to work on subsets of the data in a more ElementTree friendly manner?

Solution 3:

The only API I've seen that can handle this sort of thing at all is pulldom:

http://docs.python.org/library/xml.dom.pulldom.html

Pulldom uses the SAX API to build partial DOM nodes; by pulling in specific sub-trees as a group and then discarding them when you're done, you can get the memory efficiency of SAX with the sanity of use of DOM.

It's an incomplete API; when I used it I had to modify it to make it fully usable, but it works as a foundation. I don't use it anymore, so I don't recall what I had to add; just an advance warning.

It's very slow.

XML is a very poor format for handling large data sets. If you have any control over the source data, and if it makes sense for the data set, you're much better off breaking the data apart into smaller chunks that you can parse entirely into memory.

The other option is using SAX APIs, but they're a serious pain to do anything nontrivial with directly.

Solution 4:

Yes, ten years later, there are already many new solutions for handling large files. Below I recommend one for everyone.

For example, the content of the file test.xml is as follows

<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu>
    <food>
        <name>Strawberry Belgian Waffles</name>
        <price>$7.95</price>
        <description>
        Light Belgian waffles covered with strawberries and whipped cream
        </description>
        <calories>900</calories>
    </food>
    <food>
        <name>Berry-Berry Belgian Waffles</name>
        <price>$8.95</price>
        <description>
        Belgian waffles covered with assorted fresh berries and whipped cream
        </description>
        <calories>900</calories>
    </food>
    ......
</breakfast_menu>

The solution using SimplifiedDoc is as follows:

from simplified_scrapy import SimplifiedDoc, utils

doc = SimplifiedDoc()
doc.loadFile('test.xml', lineByline=True)

for food in doc.getIterable('food'):
    print (food.children.text)

Result:

['Strawberry Belgian Waffles', '$7.95', 'Light Belgian waffles covered with strawberries and whipped cream', '900']
...

Solution 5:

As the other answerers told ElementTree is a DOM parser, though it has iterparse() method.

To reduce the memory footprint I used a real SAX parser. Here is the link I used for my solution. Here's the official doc. Here's my XML:

<?xml version="1.0" encoding="UTF-8"?>
<metadata>
    <entity storageTableName="table7113" tableName="TableBusinessName">
        <attribute storageFieldName="field7114" fieldName="BusinessName1" />
        <attribute storageFieldName="field7115" fieldName="BusinessName2" />
        . . .
    </entity>
    . . .
</metadata>

Here's the code:

import xml.sax


class ModelNameHandler(xml.sax.ContentHandler):
    ENTITY_TAG = "entity"
    STORAGE_TABLE_NAME_ATTR = "storageTableName"
    TABLE_NAME_ATTR = "tableName"
    ATTRIBUTE_TAG = "attribute"
    STORAGE_FIELD_NAME_ATTR = "storageFieldName"
    FIELD_NAME_ATTR = "fieldName"

    def __init__(self):
        self.entity_code = None
        self.entity_names = {}
        self.attr_names = {}

    def startElement(self, tag, attributes):
        if tag == self.ENTITY_TAG:
            self.entity_code = attributes[self.STORAGE_TABLE_NAME_ATTR]
            entity_name = attributes[self.TABLE_NAME_ATTR]
            self.entity_names[self.entity_code] = entity_name
        elif tag == self.ATTRIBUTE_TAG:
            attr_code = attributes[self.STORAGE_FIELD_NAME_ATTR]
            key = self.entity_code + "." + attr_code
            attr_name = attributes[self.FIELD_NAME_ATTR]
            self.attr_names[key] = attr_name


def get_model_names(file):
    parser = xml.sax.make_parser()
    parser.setFeature(xml.sax.handler.feature_namespaces, 0)
    handler = ModelNameHandler()
    parser.setContentHandler(handler)
    parser.parse(file)

    return handler.entity_names, handler.attr_names

Works fast enough.

Just in case, a little bit more details:

import my_package as p


if __name__ == "__main__":

    with open('<my_path>/<my_file>.xml', 'r', encoding='utf_8') as file:
        entity_names, attr_names = p.get_model_names(file)

Python Programming Language