Skip to content Skip to sidebar Skip to footer

Python Regular Expressions - Extract Every Table Cell Content

Possible Duplicate: RegEx match open tags except XHTML self-contained tags If I have a string that looks something like... '123234

Solution 1:

If that markup is part of a larger set of markup, you should prefer a tool with a HTML parser. One such tool is BeautifulSoup. Here's one way to find what you need using that tool:

>>>markup = '''"<tr><td>123</td><td>234</td>...<td>697</td></tr>"'''>>>from bs4 import BeautifulSoup as bs>>>soup = bs(markup)>>>for i in soup.find_all('td'):...print(i.text)

Result:

123
234
697

Solution 2:

Don't do this. Just use a proper HTML parser, and use something like xpath to get the elements you want.

A lot of people like lxml. For this task, you will probably want to use the BeautifulSoup backend, or use BeautifulSoup directly, because this is presumably not markup from a source known to generate well-formed, valid documents.

Solution 3:

When using lxml, an element tree gets created. Each element in the element tree holds information about a tag.

from lxml import etree
root = etree.XML("<root><ax='123'>aText<b/><c/><b/></a></root>")
elements = root.findall(".//a")
tag = elements[0].tag
attr = elements[0].attrib

Post a Comment for "Python Regular Expressions - Extract Every Table Cell Content"