Trouble With Scraping
Tag And Datalist With Links In It
This is an example of the HTML I'm scraping with Python/Beautifulsoup:
Solution 2:
In [31]: for dd in soup.find_all('dd'):
...: link = dd.a.get('href')
...: link_text = dd.a.text
...: *_, dd_text = dd.stripped_strings
http://www.eslcafe.com/jobs/china/index.cgi?read=45391
Teach English in Shenyang, China: Great salary, Support, and Structured program
Greenheart Travel -- Thursday, 9 February 2017, at 1:05 p.m.
dd_text
is the last text node of dd tag, so I use *_
to represent all the text node before it.
EDIT:
In [20]: for dd in soup.find_all('dd'):
...:
...: d = {} # store data in a dict
...: d['link'] = dd.a.get('href')
...: d['link_text'] = dd.a.text
...: *_, dd_text = dd.stripped_strings
...: d['date_text'] = dd_text
...: print(d)
out:
{'date_text': 'EnglishTeacherChina.com -- Sunday, 12 February 2017, at 1:45 ''p.m.',
'link': 'http://www.eslcafe.com/jobs/china/index.cgi?read=45426',
'link_text': '❤ ❤ ❤ Teach English In China 12,000-20,000 RMB/month - Adults ''or Kids - Free Housing & Airfare - Free TEFL TESOL ''Certification - Where You Want - YOUR NEEDS ARE OUR TOP ''PRIORITY ❤ ❤ ❤'}
Post a Comment for "Trouble With Scraping
Tag And Datalist With Links In It"