sorting - spark rank - scala based one second and third elemnts of tuple of RDD -
hi assign rank each row based on second element , third element of tuple ,here have sample data . add "1" if third element of tuple has max value against id . if tuple's third element has same values values , based 1 second element of tuple i.e-maximum of second element tuple should have "1" fourth element . other fourth elements of tuple values 0 . hope understand requirement :
(id,second,third)->tuple (32609,878,199) (32609,832,199) (45470,231,199) (42482,1001,299) (42482,16,291)
code: *val rank=matching.map{{case (x1,x2,x3)=> (x1,x2,x3,((x3.toint*100000)+x2.toint).toint)}.sortby(-_.4).groupby(._1)*
result: rank.take(10).foreach(println)
(32609,compactbuffer((32609,878,199,19900878), (32609,832,199,19900832))) (45470,compactbuffer((45470,231,199,19900231))) (42482,compactbuffer((42482,1001,299,29901001), (42482,16,291,29100016)))
desired output :
(32609,878,199,1) (32609,832,199,0) (45470,231,199,1) (42482,1001,299,1) (42482,16,291,0)
seems can try following:
val rank = matching.flatmap { case (x: string, y: string, z: string) => val yint = try(y.toint) val zint = try(z.toint) if (yint.issuccess && zint.issuccess) option((x, (yint.get, zint.get))) else none }.groupbykey().flatmap { case (key: string, tuples: iterable[(int, int)]) => val sorted = tuples.tolist.sortby(x => (-x._2, -x._1)) val toprank = (key, sorted.head._1, sorted.head._2, 1) val restrank = (tup <- sorted.tail) yield (key, tup._1, tup._2, 0) list(toprank) ++ restrank }
the initial flatmap performs typechecking , reorders tuples pairs. second flatmap (after groupbykey) sorts list 3rd , 2nd elements respectively , recreates tuples rank. note need import scala.util.try
use this.
edit: modified ranking order per comment below.
Comments
Post a Comment