Extracting Tables From A Webpage Using Beautifulsoup 4
Solution 1:
First of all, if you want to extract all the tables inside a site using BeautifulSoup you could do it in the following way :
import urllib2
from bs4 import BeautifulSoup
url = raw_input('Web-Address: ')
html = urllib2.urlopen('http://' +url).read()
soup = BeautifulSoup(html)
soup.prettify()
# extract all the tables in the HTML
tables = soup.find_all('table')
#get the class name for eachfor table in tables:
class_name = table['class']
Once you have all the tables in the page you could do anything you want with its data moving for the tags tr and td in the following way :
for table in tables:
tr_tags = table.find_all('tr')
Remember that the tr tags are rows inside the table. Then to obtain the data inside the tags td you could use something like this :
for table in tables:
tr_tags = table.find_all('tr')
for tr in tr_tags:
td_tags = tr.find_all('td')
for td in td_tags:
text = td.string
If you want to surf in all the links inside the table and then find the tables the code explained above would work for you, making first the retrieve of all the urls inside an then moving between them. For example :
initial_url = 'URL'
list_of_urls = []
list_of_url.append(initial_url)
whilelen(list_of_urls) > 0:
html = urllib2.urlopen('http://' + list_of_url.pop()).read()
soup = BeautifulSoup(html)
soup.prettify()
for anchor in soup.find_all('a', href=True):
list_of_urls.append(anchor['href'])
#here put the code explained above, for examplefor table in tables:
class_name = table['class']
# continue with the above code..
To insert the data to a database in SQLite I recommend you read the following tutorial Python: A Simple Step-by-Step SQLite Tutorial
Solution 2:
You have probably already been here, but when I used BS (no pun intended) a while back, it's doc page was where I started: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Personally, I found this official documentation could have been better, and the Beautiful Soup resources from the online community also seemed lacking at the time - this was about 3 or 4 years ago though.
I hope both have come farther since.
Another resource perhaps worth looking into is Mechanize: http://wwwsearch.sourceforge.net/mechanize/
Post a Comment for "Extracting Tables From A Webpage Using Beautifulsoup 4"