python - Scrapy - Output to Multiple JSON files -

August 15, 2011

i pretty new scrapy. looking using crawl entire website links, in output items multiple json files. upload them amazon cloud search indexing. possible split items multiple files instead of having 1 giant file in end? i've read, item exporters can output 1 file per spider. using 1 crawlspider task. nice if set limit number of items included in each file, 500 or 1000.

here code have set far (based off dmoz.org used in tutorial):

dmoz_spider.py

import scrapy  scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor tutorial.items import dmozitem  class dmozspider(crawlspider):     name = "dmoz"     allowed_domains = ["dmoz.org"]     start_urls = [         "http://www.dmoz.org/",     ]      rules = [rule(linkextractor(), callback='parse_item', follow=true)]      def parse_item(self, response):        sel in response.xpath('//ul/li'):             item = dmozitem()             item['title'] = sel.xpath('a/text()').extract()             item['link'] = sel.xpath('a/@href').extract()             item['desc'] = sel.xpath('text()').extract()             yield item

items.py

import scrapy  class dmozitem(scrapy.item):     title = scrapy.field()     link = scrapy.field()     desc = scrapy.field()

thanks help.

i don't think built-in feed exporters support writing multiple files.

one option export single file in jsonlines format basically, 1 json object per line convenient pipe , split.

then, separately, after crawling done, can read file in desired chunks , write separate json files.

so upload them amazon cloud search indexing.

note there direct amazon s3 exporter (not sure helps, fyi).

Search This Blog

TSQL

python - Scrapy - Output to Multiple JSON files -

Comments

Post a Comment

Popular posts from this blog

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

android - How to create dynamically Fragment pager adapter -

1111. appearing after print sequence - php -