Skip to content Skip to sidebar Skip to footer

How To Get Missing HTML Data When Web Scraping With Python-requests

I am working on building a job board which involves scraping job data from company sites. I am currently trying to scrape Twilio at https://www.twilio.com/company/jobs. However, I

Solution 1:

Basic info for all the jobs at different offices comes back dynamically from an API call you can find in network tab. If you extract the ids from that you can then make separate requests for the detailed job info using those ids. Example as shown:

import requests
from bs4 import BeautifulSoup as bs

listings = {}

with requests.Session() as s:
    r = s.get('https://api.greenhouse.io/v1/boards/twilio/offices').json()
    for office in r['offices']:
        for dept in office['departments']: #you could perform some filtering here or later on 
            if 'jobs' in dept:
                for job in dept['jobs']:
                    listings[job['id']] = job  #store basic job info in dict
    for key in listings.keys():
        r = s.get(f'https://boards.greenhouse.io/twilio/jobs/{key}')
        soup = bs(r.content, 'lxml')
        job['soup'] = soup #store soup from detail page
        print(soup.select_one('.app-title').text) #print example something from page

Post a Comment for "How To Get Missing HTML Data When Web Scraping With Python-requests"