scala - Why does RDD.groupBy return an empty RDD if the initial RDD wasn't empty? -


i have rdd i've used load binary files. each file broken multiple parts , processed. after processing step, each entry is:

(filename, list[results]) 

since files broken several parts, filename same several entries in rdd. i'm trying put results each part using reducebykey. however, when attempt run count on rdd returns 0:

val reducedresults = my_rdd.reducebykey((resultsa, resultsb) => resultsa ++ resultsb) reducedresults.count() // 0 

i've tried changing key uses no success. extremely simple attempts group results don't output.

val singlegroup = my_rdd.groupby((k, v) => 1)  singlegroup.count() // 0 

on other hand, if collect results, can group them outside of spark , works fine. however, still have additional processing need on collected results, isn't option.

what cause groupby/reduceby commands return empty rdds if initial rdd isn't empty?

turns out there bug in how generating spark configuration particular job. instead of setting spark.default.parallelism field reasonable, being set 0.

from spark documentation on spark.default.parallelism:

default number of partitions in rdds returned transformations join, reducebykey, , parallelize when not set user.

so while operation collect() worked fine, attempt reshuffle data without specifying number of partitions gave me empty rdd. that'll teach me trust old configuration code.


Comments

Popular posts from this blog

1111. appearing after print sequence - php -

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

Ruby on Rails, ActiveRecord, Postgres, UTF-8 and ASCII-8BIT encodings -