scala - Why does RDD.groupBy return an empty RDD if the initial RDD wasn't empty? -

January 15, 2014

i have rdd i've used load binary files. each file broken multiple parts , processed. after processing step, each entry is:

(filename, list[results])

since files broken several parts, filename same several entries in rdd. i'm trying put results each part using reducebykey. however, when attempt run count on rdd returns 0:

val reducedresults = my_rdd.reducebykey((resultsa, resultsb) => resultsa ++ resultsb) reducedresults.count() // 0

i've tried changing key uses no success. extremely simple attempts group results don't output.

val singlegroup = my_rdd.groupby((k, v) => 1)  singlegroup.count() // 0

on other hand, if collect results, can group them outside of spark , works fine. however, still have additional processing need on collected results, isn't option.

what cause groupby/reduceby commands return empty rdds if initial rdd isn't empty?

turns out there bug in how generating spark configuration particular job. instead of setting spark.default.parallelism field reasonable, being set 0.

from spark documentation on spark.default.parallelism:

default number of partitions in rdds returned transformations join, reducebykey, , parallelize when not set user.

so while operation collect() worked fine, attempt reshuffle data without specifying number of partitions gave me empty rdd. that'll teach me trust old configuration code.

Search This Blog

TSQL

scala - Why does RDD.groupBy return an empty RDD if the initial RDD wasn't empty? -

Comments

Post a Comment

Popular posts from this blog

1111. appearing after print sequence - php -

node.js - Express and Redis - If session exists for this user, don't allow access -

excel - I can't get the attachement of the email PHP -