How to Save Time Browsing Paginated Websites

Sometimes, I have wasted much time on pagination issues while browsing.
Finding out how to browse through paginated websites in this tutorial by Michael Heydt, the author of Python Web Scraping Cookbook, proved helpful!
Pagination breaks large sets of content into a number of pages. Typically, these pages have a previous/next page link for the user to click. These links can generally be found with XPath or other means and followed to get to the next page (or the previous). In this tutorial, we will examine how to traverse across pages with Scrapy. For this purpose, take a hypothetical example of crawling the results of an automated internet search. The techniques directly apply to many commercial sites with search capabilities and can be easily modified according to the scenario.

Getting ready

To demonstrate handling pagination, take an example that crawls a set of pages from a sample website.  This website models five pages with previous and next links on each page along with some embedded data within each page, which we will extract.
See the first page of the set at http://localhost:5001/pagination/page1.html. The following image shows this page open; here we are inspecting the Next button:

Two parts of the page are of interest to us: the next page link and the data in the page. It is fairly common for a next page link to have a class that identifies the link for the next page. We can use this info to find the link. In this case, we can find it using the following XPath:
//*/a[@class='next']
The data on the page can be identified by a <div> tag with a class="data" attribute. The five pages in our sample website only have one data item each, but in our example of crawling the pages resulting in a search, we will pull multiple items.
Now let's go and run a scraper for these pages.

How to do it

There is a script named 06/08_scrapy_pagination.py. Running this script with Python will give a huge output from Scrapy, most of which will be the standard Scrapy debugging output. However, the output does show the data items we extracted from all five pages:
Page 1 Data
Page 2 Data
Page 3 Data
Page 4 Data
Page 5 Data

How it works

The code begins with the definition of CrawlSpider and the start URL:
class PaginatedSearchResultsSpider(CrawlSpider):
    name = "paginationscraper"
    start_urls = [
"http://localhost:5001/pagination/page1.html"
    ]
Then, we define the rules field, which informs Scrapy how to parse each page to look for links. This code uses the XPath discussed above to find the Next link in the page. Scrapy will use this rule on every page to find the next page and queue the request for processing this page after the current page. For each page that is found, the callback parameter informs Scrapy about the method to be called for processing—in this case, parse_result_page:



rules = (
# Extract links for next pages
    Rule(LinkExtractor(allow=(),
restrict_xpaths=("//*/a[@class='next']")),
callback='parse_result_page', follow=True),
)
A single list variable named all_items is declared to hold all the items we find:
all_items = []
Then, the parse_start_url method is defined. Scrapy will call this to parse the initial URL in the crawl. The function simply defers the processing to parse_result_page:
def parse_start_url(self, response):
    return self.parse_result_page(response)
The parse_result_page method then uses XPath to find the text inside the <h1> tag within the <div class="data"> tag. It then appends this text to the all_items list:
def parse_result_page(self, response):
    data_items = response.xpath("//*/div[@class='data']/h1/text()")
for data_item in data_items:
    self.all_items.append(data_item.root)
Once the crawl is completed, the closed() method is called, which writes out the content of the all_items field:
def closed(self, reason):
    for i in self.all_items:
        print(i)
The crawler is run using Python as a script using the following:
if __name__ == "__main__":
    process = CrawlerProcess({
        'LOG_LEVEL': 'DEBUG',
        'CLOSESPIDER_PAGECOUNT': 10
    })
    process.crawl(ImdbSearchResultsSpider)
    process.start()
Please note that the CLOSESPIDER_PAGECOUNT property is set to 10. This exceeds the number of pages on this site, but in most cases, there will be thousands of pages in a search result. It is a good practice to stop after an appropriate number of pages, as the relevance of items in your search drops dramatically after a few pages.

For more, Python Web Scraping Cookbook is a productive tool. With over 90 proven Python scraping recipes the book is helpful for Python programmers, web administrators, security professionals, and experts performing web analytics.

Commentaires

Posts les plus consultés de ce blog

How to Build Machine Learning Models in C#

Artificial Intelligence By Example