How To Get Missing HTML Data When Web Scraping With Python-requests
I am working on building a job board which involves scraping job data from company sites. I am currently trying to scrape Twilio at https://www.twilio.com/company/jobs. However, I
Solution 1:
Basic info for all the jobs at different offices comes back dynamically from an API call you can find in network tab. If you extract the ids from that you can then make separate requests for the detailed job info using those ids. Example as shown:
import requests
from bs4 import BeautifulSoup as bs
listings = {}
with requests.Session() as s:
r = s.get('https://api.greenhouse.io/v1/boards/twilio/offices').json()
for office in r['offices']:
for dept in office['departments']: #you could perform some filtering here or later on
if 'jobs' in dept:
for job in dept['jobs']:
listings[job['id']] = job #store basic job info in dict
for key in listings.keys():
r = s.get(f'https://boards.greenhouse.io/twilio/jobs/{key}')
soup = bs(r.content, 'lxml')
job['soup'] = soup #store soup from detail page
print(soup.select_one('.app-title').text) #print example something from page
Post a Comment for "How To Get Missing HTML Data When Web Scraping With Python-requests"