python - Reduce job in Spark by reduceByKey() or other functions? -
given following list:
[(0, [135, 2]), (0, [2409, 1]), (0, [12846, 2]), (1, [13840, 2]), ...]
i need output list of first elements of list-value (i.e., 135, 2409, 12846
key 0
, 13840
key 1
) each key if second element of list-value (i.e., 2, 1, 2
0
, 2
1
) greater or equal value (let's 2). instance, in particular case output should be:
[(0, [135, 12846]), (1, [13840]), ...]
the tuple (0, [2409, 1])
discarded because 1 < 2
.
i've achieved applying groupbykey()
, mapvalues(list)
, final map
function, groupbykey()
less efficient reduce functions.
is possible achieving task using reducebykey()
or combinebykey()
function?
the answer yes :) can achieve same reducebykey
groupbykey
. in fact, reducebykey
should favoured performs map side reduce before shuffling data.
a solution using reducebykey
(in scala, i'm sure point, , can convert python if prefer):
val rdd = sc.parallelize(list((0, list(135, 2)), (0, list(2409, 1)), (0, list(12846, 2)), (1, list(13840, 2)))) rdd.mapvalues(v => if(v(1) >= 2) list(v(0)) else list.empty) .reducebykey(_++_)
Comments
Post a Comment