Skip to content Skip to sidebar Skip to footer

Retrieving Essential Data From A Webpage Using Python

Following is a part of a webpage i downloaded with urlretrieve (urllib). I want to write only this data from the webpage given below in to another text file as: ENGINEERING MATHEMA

Solution 1:

import urllib2
import BeautifulSoup

def main():
    infname  = 'htmltable.html'
    outfname = 'courses.txt'

    with open(infname) as inf:
        html = inf.read()

    doc   = BeautifulSoup.BeautifulSoup(html)
    table = doc.find('table',{'id':'content'})

    with open(outfname, 'w') as outf:
        for row in table.findAll('tr'):
            id,name,a,b,c,d = [cell.getText().strip() for cell in row.findAll('td')]
            outf.write("{name}, {a}, {b}, {c}, {d}\n".format(id=id, name=name, a=a, b=b, c=c, d=d))

if __name__=="__main__":
    main()            

works quite nicely if you assume the saved page starts like

<html><head><title>Data Table</title></head><body>
<table id='content'>
<tr align=left bgcolor='#FFFFFF'>       <td>EIT402    </td>
    <td>ENGINEERING MATHEMATICS-IV</td>
        <td align=center>4</td>
        <td align=center>36</td>
        <td align=center>40</td>
        <td align=center>F</td>
    </tr>

resulting in

ENGINEERING MATHEMATICS-IV, 4, 36, 40, F
ENVIRONMENTAL STUDIES, 47, 36, 83, P
SYSTEM PROGRAMMING, 40, 36, 76, P
MICROPROCESSOR BASED DESIGN, 3, 35, 38, F
PROGRAMMING PARADIGMS, 42, 36, 78, P
COMMUNICATION SYSTEMS, 9, 35, 44, F
DATA STRUCTURE LAB, 10, 35, 45, F
PROGRAMMING  ENVIRONMENTS  LAB, 20, 25, 45, F

Post a Comment for "Retrieving Essential Data From A Webpage Using Python"