python - Scrapy - Output to Multiple JSON files -
i pretty new scrapy. looking using crawl entire website links, in output items multiple json files. upload them amazon cloud search indexing. possible split items multiple files instead of having 1 giant file in end? i've read, item exporters can output 1 file per spider. using 1 crawlspider task. nice if set limit number of items included in each file, 500 or 1000.
here code have set far (based off dmoz.org used in tutorial):
dmoz_spider.py
import scrapy scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor tutorial.items import dmozitem class dmozspider(crawlspider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/", ] rules = [rule(linkextractor(), callback='parse_item', follow=true)] def parse_item(self, response): sel in response.xpath('//ul/li'): item = dmozitem() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item
items.py
import scrapy class dmozitem(scrapy.item): title = scrapy.field() link = scrapy.field() desc = scrapy.field()
thanks help.
i don't think built-in feed exporters support writing multiple files.
one option export single file in jsonlines
format basically, 1 json object per line convenient pipe , split.
then, separately, after crawling done, can read file in desired chunks , write separate json files.
so upload them amazon cloud search indexing.
note there direct amazon s3 exporter (not sure helps, fyi).
Comments
Post a Comment