hadoop - get MAX from SUM in PIG -
player = load 'ass2_player' using org.apache.hive.hcatalog.pig.hcatloader(); player = foreach player generate (chararray)$3 tmid, (int)$1 year, (int)$8 points; group_data = group player (year, tmid); sum_data = foreach group_data generate group, sum(player.points) tot_points; max_data = foreach sum_data generate flatten(group), max(sum_data.tot_points); dump max_data;
i want select tmid of team has highest point each year.
how whole row or partial fields or row max value. like, after group year, group contains "year" , tuple take "tmid, tot_points". how got like: (year, tmid, tot_points) each year.
you there. here schema sum_data
:
((year, tmid), tot_points)
from here, need group
on year , take max
on tot_points
. easier if flatten
group in sum_data
step only, such as:
sum_data = foreach group_data generate flatten(group) (year, tmid), sum(player.points) tot_points; sum_data_grouped = group sum_data year; max_data = foreach sum_data_grouped generate group year, max(sum_data.tot_points) max_points, sum_data.tmpid tmid;
your final script should this:
player = load 'ass2_player' using org.apache.hive.hcatalog.pig.hcatloader(); player = foreach player generate (chararray)$3 tmid, (int)$1 year, (int)$8 points; group_data = group player (year, tmid); sum_data = foreach group_data generate flatten(group) (year, tmid), sum(player.points) tot_points; sum_data_grouped = group sum_data year; max_data = foreach sum_data_grouped generate group year, max(sum_data.tot_points) max_points, sum_data.tmpid tmid;
ps: wrote on mobile , didn't test this. let me know get.
Comments
Post a Comment