python - Scrapy - Output to Multiple JSON files -


i pretty new scrapy. looking using crawl entire website links, in output items multiple json files. upload them amazon cloud search indexing. possible split items multiple files instead of having 1 giant file in end? i've read, item exporters can output 1 file per spider. using 1 crawlspider task. nice if set limit number of items included in each file, 500 or 1000.

here code have set far (based off dmoz.org used in tutorial):

dmoz_spider.py

import scrapy  scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor tutorial.items import dmozitem  class dmozspider(crawlspider):     name = "dmoz"     allowed_domains = ["dmoz.org"]     start_urls = [         "http://www.dmoz.org/",     ]      rules = [rule(linkextractor(), callback='parse_item', follow=true)]      def parse_item(self, response):        sel in response.xpath('//ul/li'):             item = dmozitem()             item['title'] = sel.xpath('a/text()').extract()             item['link'] = sel.xpath('a/@href').extract()             item['desc'] = sel.xpath('text()').extract()             yield item 

items.py

import scrapy  class dmozitem(scrapy.item):     title = scrapy.field()     link = scrapy.field()     desc = scrapy.field() 

thanks help.

i don't think built-in feed exporters support writing multiple files.

one option export single file in jsonlines format basically, 1 json object per line convenient pipe , split.

then, separately, after crawling done, can read file in desired chunks , write separate json files.


so upload them amazon cloud search indexing.

note there direct amazon s3 exporter (not sure helps, fyi).


Comments

Popular posts from this blog

1111. appearing after print sequence - php -

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

node.js - Express and Redis - If session exists for this user, don't allow access -