Skip to content Skip to sidebar Skip to footer

Scraping Part Of A Wikipedia Infobox

I'm using Python 2.7, requests & BeautifulSoup to scrape approximately 50 Wikipedia pages. I've created a column in my dataframe that has partial URL's that relate to the name

Solution 1:

Rather than try and parse the HTML output, try and parse the raw MediaWiki source for the page; the first line that starts with | Length contains the information you are looking for:

url = 'http://en.wikipedia.org/wiki/' + xyz
resp = requests.get(url, params={'action': 'raw'})
page = resp.text
for line in page.splitlines():
    if line.startswith('| Length'):
       length = line.partition('=')[-1].strip()
       break

Demo:

>>> import requests
>>> xyz = 'No_One_Knows'
>>> url = 'http://en.wikipedia.org/wiki/' + xyz
>>> resp = requests.get(url, params={'action': 'raw'})
>>> page = resp.text
>>> for line in page.splitlines():
...     if line.startswith('| Length'):
...        length = line.partition('=')[-1].strip()
...        break
... 
>>> print length
4:13 <small>(Radio edit)</small><br />4:38 <small>(Album version)</small>

You can further process this to extract the richer data here (the Radio edit vs. *Album version) as required.


Post a Comment for "Scraping Part Of A Wikipedia Infobox"