python - Reduce job in Spark by reduceByKey() or other functions? -


given following list:

[(0, [135, 2]), (0, [2409, 1]), (0, [12846, 2]), (1, [13840, 2]), ...] 

i need output list of first elements of list-value (i.e., 135, 2409, 12846 key 0 , 13840 key 1) each key if second element of list-value (i.e., 2, 1, 2 0 , 2 1) greater or equal value (let's 2). instance, in particular case output should be:

[(0, [135, 12846]), (1, [13840]), ...] 

the tuple (0, [2409, 1]) discarded because 1 < 2.

i've achieved applying groupbykey(), mapvalues(list) , final map function, groupbykey() less efficient reduce functions.

is possible achieving task using reducebykey() or combinebykey() function?

the answer yes :) can achieve same reducebykey groupbykey. in fact, reducebykey should favoured performs map side reduce before shuffling data.

a solution using reducebykey (in scala, i'm sure point, , can convert python if prefer):

val rdd = sc.parallelize(list((0, list(135, 2)), (0, list(2409, 1)), (0, list(12846, 2)), (1, list(13840, 2)))) rdd.mapvalues(v => if(v(1) >= 2) list(v(0)) else list.empty)    .reducebykey(_++_) 

Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -