Python: Parsing Text from Wikipedia - Jan Švec

This note should be useful for anyone working in machine learning who needs to process large amounts of text from Wikipedia. I will show a few pieces of code that make it easier to start extracting clean text from MediaWiki pages.

Where to Get the Data

Wikipedia publishes all its data in many formats. For more detailed reading, see Wikipedia download. From there, it is easy to reach https://dumps.wikimedia.org/enwiki/latest/ or, for Czech Wikipedia, https://dumps.wikimedia.org/cswiki/latest/.

From there, XML dumps can be downloaded at several levels of detail, including titles, article texts, or text plus history. We need files in the form enwiki-latest-pages-articles*.xml-*.bz2, or analogous files for Czech Wikipedia with the cswiki-* prefix. Download the file or files and we can start processing them.

What Is Needed?

For processing we mainly need Python and the mwlib library. If your Python distribution has the pip command, installation is simple:

pip install mwlib (--user)

Use --user if you want to install only for the current user.

Imports

As usual, we start with imports and constants that will be needed later:

import xml.etree.ElementTree as ET
import bz2

from mwlib.parser import nodes
from mwlib.refine.compat import parse_txt

MW_NS = "{http://www.mediawiki.org/xml/export-0.9/}"

Parsing XML

The downloaded dump is XML compressed with bzip2. We can use Python’s ability to parse an open file with ElementTree. First, we open the bz2 file; in this article I use the Czech Wikipedia dump cswiki-latest-pages-articles.xml.bz2. Then we pass it to the ElementTree parser. Be careful: keeping the whole element tree in memory requires enough RAM. In my case it was on the order of 20 GB. Those who do not want to spend that much memory should use another parsing approach. UPDATE: More in Python: Iterative XML Parsing.

After that, we obtain the root element and use find() to get individual page elements representing Wikipedia pages. The whole XML file has an XML namespace, so tag and attribute names are prefixed with that namespace. The namespace name is stored in MW_NS. For ElementTree, the namespace URI is enclosed in braces and placed before the tag name.

After obtaining the XML page element, we find the descendants containing the title and text of the latest revision and yield everything for further processing:

def parse_dump(xml_fn):
    with bz2.BZ2File(xml_fn, 'r') as fr:
        tree = ET.parse(fr)
        root = tree.getroot()

    for page in root.findall('./{0}page'.format(MW_NS)):
        text = page.find('{0}revision/{0}text'.format(MW_NS))

        if text is None:
            self.info("Skipped page: %s, no text", id)
            continue
        else:
            text = text.text

        title = page.find('{0}title'.format(MW_NS)).text

        yield title, text

Removing MediaWiki Markup

The generator above returns the page title and the text of its latest revision, but the text still contains MediaWiki markup. We therefore use mwlib to remove that markup. We also skip markup that refers to images and tables, because we want clean text. Besides mwlib we use regular expressions for some preprocessing: removing MediaWiki control blocks and expanding simple links, in other words removing {{...}} blocks and replacing [[foo]] with [[foo|foo]].

IGNORE = (nodes.ImageLink, nodes.Table, nodes.CategoryLink)

def get_text(p, buffer=None, depth=0):
    if buffer is None:
        buffer = []

    for ch in p.children:
        descend = True
        if isinstance(ch, IGNORE):
            continue
        elif isinstance(ch, nodes.Section):
            for ch in ch.children[1:]:
                get_text(ch, buffer, depth+1)
            descend = False
        elif isinstance(ch, nodes.Text):
            text = ch.asText()
            text = ' '.join(text.splitlines())
            buffer.append(text)

        if descend:
            get_text(ch, buffer, depth+1)

    return buffer

def wiki_to_text(raw_text):
    raw_text = re.sub('(?s){{.*?}}', '', raw_text)
    raw_text = re.sub('(?s)\[\[([^|]+?)\]\]', r'[[\1|\1]]', raw_text)

    parsed = parse_txt(raw_text)
    text = get_text(parsed)
    text = ''.join(text)
    return text

The wiki_to_text() function first preprocesses the text and then processes the resulting MediaWiki markup with parse_txt() from mwlib. The resulting object tree is passed to get_text(), which recursively walks through the tree and returns a list of strings. It ignores objects listed in the IGNORE tuple: images, tables, and category links. Now we can assemble a small test script that glues everything together and prints the first ten articles to standard output:

idx = 0
for title, raw_text in parse_dump('/home/honzas/tmp/cswiki-latest-pages-articles.xml.bz2'):
    print title
    print wiki_to_text(raw_text)
    print
    idx += 1
    if idx >= 10:
        break

Where to Use It

There are many possible uses. In artificial intelligence and computational linguistics, examples include training language models, transformations for latent semantic analysis, word statistics, word prediction, and more. Imagination is the limit.