Skip to content Skip to sidebar Skip to footer

Scraping A Table From A Page Using Beautifulsoup, Table Is Not Found

I've been trying to scrape the table from here but it seems to me that BeautifulSoup doesn't find any table. I wrote: import requests import pandas as pd from bs4 import BeautifulS

Solution 1:

You are parsing html but you used xml parser. You should use soup=BeautifulSoup(data,"html.parser") Your necessary data is in script tag, in fact there is no table tag actually. So, you need to find texts inside script. N.B: If you are using Python 2.x then use "HTMLParser" instead of "html.parser".

Here is the code.

import csv
import requests
from bs4 import BeautifulSoup

url = "http://www.payscale.com/college-salary-report/bachelors?page=65" 
r=requests.get(url)
data=r.text

soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")

file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)
list_to_write = []

list_to_write.append(["Rank","School Name","School Type","Early Career Median Pay","Mid-Career Median Pay","% High Job Meaning","% STEM"])

for script in scripts:
    text = script.text
    start = 0end = 0if(len(text) > 10000):
        while(start > -1):
            start = text.find('"School Name":"',start)
            if(start == -1):
                break
            start += len('"School Name":"')
            end = text.find('"',start)
            school_name = text[start:end]

            start = text.find('"Early Career Median Pay":"',start)
            start += len('"Early Career Median Pay":"')
            end = text.find('"',start)
            early_pay = text[start:end]

            start = text.find('"Mid-Career Median Pay":"',start)
            start += len('"Mid-Career Median Pay":"')
            end = text.find('"',start)
            mid_pay = text[start:end]

            start = text.find('"Rank":"',start)
            start += len('"Rank":"')
            end = text.find('"',start)
            rank = text[start:end]

            start = text.find('"% High Job Meaning":"',start)
            start += len('"% High Job Meaning":"')
            end = text.find('"',start)
            high_job = text[start:end]

            start = text.find('"School Type":"',start)
            start += len('"School Type":"')
            end = text.find('"',start)
            school_type = text[start:end]

            start = text.find('"% STEM":"',start)
            start += len('"% STEM":"')
            end = text.find('"',start)
            stem = text[start:end]

            list_to_write.append([rank,school_name,school_type,early_pay,mid_pay,high_job,stem])
writer.writerows(list_to_write)
file_name.close()

This will generate your necessary table in csv. Don't forget to close the file when you are done.

Solution 2:

While this won't find the table that's not in r.text, you are asking BeautifulSoup to use the xml parser instead of html.parser so I would recommend changing that line to:

soup=BeautifulSoup(data,'html.parser')

One of the issues you will run into with web scraping is what are called "client-rendered" websites versus server-rendered. Basically, this means that the page you would get from a basic html request through the requests module or through curl for example is not the same content that would be rendered in a web browser. Some of the common frameworks for this are React and Angular. If you examine the source of the page you are wanting to scrape, they have data-react-ids on several of their html elements. A common tell for Angular pages are similar element attributes with the prefix ng, e.g. ng-if or ng-bind. You can see the page's source in Chrome or Firefox through their respective dev tools, which can be launched with the keyboard shortcut Ctrl+Shift+I in either browser. It's worth noting that not all React & Angular pages are only client-rendered.

In order to get this sort of content, you would need to use a headless browser tool like Selenium. There are many resources on web scraping with Selenium and Python.

Solution 3:

The data is located in JavaScript variable, you should find the js text data then use regex to extract it. when you get the data, it's json list object which contains 900+ school dict, you should use json module to load it to python list obejct.

import requests, bs4, re, json

url = "http://www.payscale.com/college-salary-report/bachelors?page=65"
r = requests.get(url)
data = r.text
soup = bs4.BeautifulSoup(data, 'lxml')
var = soup.find(text=re.compile('collegeSalaryReportData'))
table_text = re.search(r'collegeSalaryReportData = (\[.+\]);\n    var', var, re.DOTALL).group(1)
table_data = json.loads(table_text)
pprint(table_data)
print('The number of school', len(table_data))

out:

 {'% Female': '0.57',
  '% High Job Meaning': 'N/A',
  '% Male': '0.43',
  '% Pell': 'N/A',
  '% STEM': '0.1',
  '% who Recommend School': 'N/A',
  'Division 1 Basketball Classifications': 'Not Division 1 Basketball',
  'Division 1 Football Classifications': 'Not Division 1 Football',
  'Early Career Median Pay': '36200',
  'IPEDS ID': '199643',
  'ImageUrl': '/content/school_logos/Shaw University_50px.png',
  'Mid-Career Median Pay': '45600',
  'Rank': '963',
  'School Name': 'Shaw University',
  'School Sector': 'Private not-for-profit',
  'School Type': 'Private School, Religious',
  'State': 'North Carolina',
  'Undergraduate Enrollment': '1664',
  'Url': '/research/US/School=Shaw_University/Salary',
  'Zip Code': '27601'}]
Thenumberof school 963

Post a Comment for "Scraping A Table From A Page Using Beautifulsoup, Table Is Not Found"