python - Reduce job in Spark by reduceByKey() or other functions? -

April 15, 2014

given following list:

[(0, [135, 2]), (0, [2409, 1]), (0, [12846, 2]), (1, [13840, 2]), ...]

i need output list of first elements of list-value (i.e., 135, 2409, 12846 key 0 , 13840 key 1) each key if second element of list-value (i.e., 2, 1, 2 0 , 2 1) greater or equal value (let's 2). instance, in particular case output should be:

[(0, [135, 12846]), (1, [13840]), ...]

the tuple (0, [2409, 1]) discarded because 1 < 2.

i've achieved applying groupbykey(), mapvalues(list) , final map function, groupbykey() less efficient reduce functions.

is possible achieving task using reducebykey() or combinebykey() function?

the answer yes :) can achieve same reducebykey groupbykey. in fact, reducebykey should favoured performs map side reduce before shuffling data.

a solution using reducebykey (in scala, i'm sure point, , can convert python if prefer):

val rdd = sc.parallelize(list((0, list(135, 2)), (0, list(2409, 1)), (0, list(12846, 2)), (1, list(13840, 2)))) rdd.mapvalues(v => if(v(1) >= 2) list(v(0)) else list.empty)    .reducebykey(_++_)

Search This Blog

TSQL

python - Reduce job in Spark by reduceByKey() or other functions? -

Comments

Post a Comment

Popular posts from this blog

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

android - How to create dynamically Fragment pager adapter -

1111. appearing after print sequence - php -