python - Scrapy: how to follow multiple links on a page using regex -
i have scraper collects information perfectly, when try implement rules crawl "next" page stuck. using scrapy 0.22 (i can't upgrade @ time).
import re import datetime import dateutil import urllib2 scrapy.http import request scrapy.selector import selector scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.contrib.spiders import crawlspider, rule crawlers.spiders import basecrawler class rappspider(basecrawler): name = "rapp" base_url = "www.example.com" start_urls = [ # "http://www.example.com/news-perspective", # "http://www.example.com/news-perspective?f[0]=field_related_topics%3a31366", "http://www.example/news-perspective?key=&page=%d" ] # rules = [ # rule(sgmllinkextractor(allow=r'?key=&page=[0-9]'), callback='get_article_links', follow= true) # ] title_xpath_selector= "//div[@id='inset-content']//h1/text()" text_xpath_selector = "//div[@class='field-item even']/p/text()" datetime_xpath_selector = "//div[@class='field-items']/div/span/text()" def get_article_links(self, response, *args, **kwargs): html = selector(response) link_extractor = sgmllinkextractor(allow=('http://www.example.com/news-perspective/\d{4}/\d{2}\/*\s*$',)) is_relative_path = false yield [link.url link in link_extractor.extract_links(response)], is_relative_path the scraper works start_urls http://www.example/news-perspective lists number of articles on page, scraper follow links defined get_article_links , relevant information. however, i'd able go next page (same format on other pages, url being
http://www.example/news-perspective?key=&page=#
how can set existing code? need 2 separate rules ? or need alter start_requests?
Comments
Post a Comment