Skip to content Skip to sidebar Skip to footer

Scraping From Web Page And Reformatting To A Calender File

I'm trying to scrape this site: http://stats.swehockey.se/ScheduleAndResults/Schedule/3940 And I've gotten as far (thanks to alecxe) as retrieving the date and teams. from scrapy.i

Solution 1:

I'm just guessing that home games are the ones with the team you're looking for first (before the dash).

You can do this in XPath or from python. If you want to do it in XPath, only select the rows which contain the home team name.

//table[@class="tblContent"]/tr[
    contains(substring-before(.//td[3]/text(), "-"), "AIK")
  or
    contains(substring-before(.//td[3]/text(), "-"), "Djurgårdens IF")
]

You can savely remove all whitespace (including newlines), I just added them for readability.

For python you should be able to do much the same, maybe even more concise using some regular expressions.


Solution 2:

A few points to note:

  1. string is a built-in type, so it's generally good practice to avoid using it for your own variables
  2. Removing whitespace was indeed the way to clean up home_team enough to do a straight comparison with the required "AIK". I used string.strip() on home_team and away_team as it's a little cleaner than string.replace(" ", "") but that's a personal thing
  3. I also added a ":" between the home and away teams in the print lines to distinguish between them more clearly when I was testing, so feel free to get rid of that change

Have a check and let me know if there are any other issues. :)

   def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            for fixture in item['teams']:
                teams = fixture.split('-') #split it
                home_team = teams[0].strip()
                away_team = teams[1].strip()

                if home_team == "AIK":
                    for fixDate in item['date']:
                            year = fixDate[0:4]
                            month = fixDate[5:7]
                            day = fixDate[8:10]
                            hour = fixDate[11:13]
                            minute = fixDate[14:16]
                            print year, month, day, hour, minute, home_team, ":", away_team
                elif home_team == u"Djurgårdens IF":
                    for fixDate in item['date']:
                        year = fixDate[0:4]
                        month = fixDate[5:7]
                        day = fixDate[8:10]
                        hour = fixDate[11:13]
                        minute = fixDate[14:16]
                        print year, month, day, hour, minute, home_team, ":", away_team

Post a Comment for "Scraping From Web Page And Reformatting To A Calender File"