Skip to content Skip to sidebar Skip to footer

Trying To Use Python To Parse A Docx Document In Xml Format To Print Words That Are In Bold

I have a word docx file that I would like to print the words that are in Bold looking through the document in xml format it seems the words I'm looking to print have the following

Solution 1:

If you want to find all the bold text you can use findall() with an xpath expression:

from lxml import etree

namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

root = etree.parse('document.xml').getroot()
for e in root.findall('.//w:r/w:rPr/w:b/../../w:t', namespaces):
    print(e.text)

Instead of looking for w:r nodes with w:rsidRPr="00510F21" as an attribute (which I am not convinced denotes bolded text), look for run nodes (w:r) with w:b in the run properties tag (w:rPr), and then access the text tag (w:t) within. The w:b tag is the bold property as documented here.

The xpath expression can be simplified to './/w:b/../../w:t' although this is less rigorous and might result in false matches.

Solution 2:

Consider lxml's xpath() method. Recall .get() retrieves attributes and .find() retrieves nodes. And because the XML has namespaces in attributes, you will need to prefix the URI in the .get() call. Finally, use the .nsmap object to retrieve all namespaces at the root of the document.

from lxml import etree
doc = etree.parse("document.xml")
root = doc.getroot()

for wr_roots in doc.xpath('//w:r', namespaces=root.nsmap):
    if wr_roots.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr')\
       == '00510F21':
        print(wr_roots.find('w:t', namespaces=root.nsmap).text)

# Print this stuff

Post a Comment for "Trying To Use Python To Parse A Docx Document In Xml Format To Print Words That Are In Bold"