python - Reduce job in Spark by reduceByKey() or other functions? -
given following list:
[(0, [135, 2]), (0, [2409, 1]), (0, [12846, 2]), (1, [13840, 2]), ...] i need output list of first elements of list-value (i.e., 135, 2409, 12846 key 0 , 13840 key 1) each key if second element of list-value (i.e., 2, 1, 2 0 , 2 1) greater or equal value (let's 2). instance, in particular case output should be:
[(0, [135, 12846]), (1, [13840]), ...] the tuple (0, [2409, 1]) discarded because 1 < 2.
i've achieved applying groupbykey(), mapvalues(list) , final map function, groupbykey() less efficient reduce functions.
is possible achieving task using reducebykey() or combinebykey() function?
the answer yes :) can achieve same reducebykey groupbykey. in fact, reducebykey should favoured performs map side reduce before shuffling data.
a solution using reducebykey (in scala, i'm sure point, , can convert python if prefer):
val rdd = sc.parallelize(list((0, list(135, 2)), (0, list(2409, 1)), (0, list(12846, 2)), (1, list(13840, 2)))) rdd.mapvalues(v => if(v(1) >= 2) list(v(0)) else list.empty) .reducebykey(_++_)
Comments
Post a Comment