Python: Iterative XML Parsing - Jan Švec

This article is only a short extension of the previous post on parsing a Wikipedia XML dump. I will show a piece of code that enables iterative XML loading with ElementTree, which is now part of the Python standard library.

Why Iterative Parsing?

Because even if you have a lot of memory, it is still finite. The 20 GB limit mentioned in the previous article is enough when the compressed dump is hundreds of MB, but some English Wikipedia dumps have several GB of compressed XML. That will not fit comfortably in ordinary memory at once. Therefore XML needs to be parsed iteratively. To avoid changing the code from the previous article completely, we will again use ElementTree and the iterparse() function.

How to Do It

The iterparse() function is an interesting hybrid between SAX and tree-oriented ElementTree. In practice, it can report that it has started creating a subtree belonging to an element and that it has finished. iterparse() returns a generator that gradually yields event-element pairs. Returned elements should be processed on the end event, when the XML subtree has already been fully created; that is not necessarily true on start.

The XML tree can be modified during parsing, so already processed elements can be removed from the tree and memory can be saved substantially. See elem.clear(), which clears the whole subtree and in this case leaves only an empty page element.

And now the code:

def parse_dump(xml_fn):
    with bz2.BZ2File(xml_fn, 'r') as fr:
        for event, elem in ET.iterparse(fr):
            if event == 'end' and elem.tag == '{0}page'.format(MW_NS):
                text = elem.find('{0}revision/{0}text'.format(MW_NS))

                title = elem.find('{0}title'.format(MW_NS)).text

                yield title, text.text
                elem.clear()

The interface is unchanged compared with the previous version of parse_dump(). After a subtree is created, queries with find() can still be used. Besides the end event, events such as start, start-ns, and end-ns are also available.

The code above uses prior knowledge of the resulting XML structure. If the XML contained recursively nested page elements, it would be necessary to keep the path, or nesting of elements, for the currently processed element using start and end events.

Conclusion

The code shown here is meant as an initial push in the right direction, not as a complete guide to processing large XML files. For more complex behaviour, you will certainly find more on the web and in forums.