Trying To Use Python To Parse A Docx Document In Xml Format To Print Words That Are In Bold
Solution 1:
If you want to find all the bold text you can use findall()
with an xpath
expression:
from lxml import etree
namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
root = etree.parse('document.xml').getroot()
for e in root.findall('.//w:r/w:rPr/w:b/../../w:t', namespaces):
print(e.text)
Instead of looking for w:r
nodes with w:rsidRPr="00510F21"
as an attribute (which I am not convinced denotes bolded text), look for run nodes (w:r
) with w:b
in the run properties tag (w:rPr
), and then access the text tag (w:t
) within. The w:b
tag is the bold property as documented here.
The xpath expression can be simplified to './/w:b/../../w:t'
although this is less rigorous and might result in false matches.
Solution 2:
Consider lxml's xpath()
method. Recall .get()
retrieves attributes and .find()
retrieves nodes. And because the XML has namespaces in attributes, you will need to prefix the URI in the .get()
call. Finally, use the .nsmap
object to retrieve all namespaces at the root of the document.
from lxml import etree
doc = etree.parse("document.xml")
root = doc.getroot()
for wr_roots in doc.xpath('//w:r', namespaces=root.nsmap):
if wr_roots.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr')\
== '00510F21':
print(wr_roots.find('w:t', namespaces=root.nsmap).text)
# Print this stuff
Post a Comment for "Trying To Use Python To Parse A Docx Document In Xml Format To Print Words That Are In Bold"