python - Scrapy - Output to Multiple JSON files -


i pretty new scrapy. looking using crawl entire website links, in output items multiple json files. upload them amazon cloud search indexing. possible split items multiple files instead of having 1 giant file in end? i've read, item exporters can output 1 file per spider. using 1 crawlspider task. nice if set limit number of items included in each file, 500 or 1000.

here code have set far (based off dmoz.org used in tutorial):

dmoz_spider.py

import scrapy  scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor tutorial.items import dmozitem  class dmozspider(crawlspider):     name = "dmoz"     allowed_domains = ["dmoz.org"]     start_urls = [         "http://www.dmoz.org/",     ]      rules = [rule(linkextractor(), callback='parse_item', follow=true)]      def parse_item(self, response):        sel in response.xpath('//ul/li'):             item = dmozitem()             item['title'] = sel.xpath('a/text()').extract()             item['link'] = sel.xpath('a/@href').extract()             item['desc'] = sel.xpath('text()').extract()             yield item 

items.py

import scrapy  class dmozitem(scrapy.item):     title = scrapy.field()     link = scrapy.field()     desc = scrapy.field() 

thanks help.

i don't think built-in feed exporters support writing multiple files.

one option export single file in jsonlines format basically, 1 json object per line convenient pipe , split.

then, separately, after crawling done, can read file in desired chunks , write separate json files.


so upload them amazon cloud search indexing.

note there direct amazon s3 exporter (not sure helps, fyi).


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -