apache spark - PySpark RDD processing for sum of parts -
i have rdd tuples (datetime, integer). , try rdd of interval summation pyspark.
for example, followings
(2015-09-30 10:00:01, 3) (2015-09-30 10:00:02, 1) (2015-09-30 10:00:05, 2) (2015-09-30 10:00:06, 7) (2015-09-30 10:00:07, 3) (2015-09-30 10:00:10, 5)
i'm trying followings sum of every 3 seconds:
(2015-09-30 10:00:01, 4) # sum of 1, 2, 3 seconds (2015-09-30 10:00:02, 1) # sum of 2, 3, 4 seconds (2015-09-30 10:00:05, 12) # sum of 5, 6, 7 seconds (2015-09-30 10:00:06, 10) # sum of 6, 7, 8 seconds (2015-09-30 10:00:07, 3) # sum of 7, 8, 9 seconds (2015-09-30 10:00:10, 5) # sum of 10, 11, 12 seconds
please, give me hints?
i assume input rdd time_rdd
tuples first element datetime object , second element integer. use flatmap map every datetime object previous 3 seconds , use reducebykey total count window.
def map_to_3_seconds(datetime_obj, count): list_times = [] in range(-2, 1): list_times.append((datetime_obj + timedelta(seconds = i), count)) return list_times output_rdd = time_rdd.flatmap(lambda (datetime_obj, count): map_to_3_seconds(datetime_obj, count)).reducebykey(lambda x,y: x+y)
this rdd contain more datetime objects ones in original rdd, if want have original times, need join time_rdd
,
result = output_rdd.join(time_rdd).map(lambda (key, vals): (key, vals[0])).collect()
now result contain:
[(datetime.datetime(2015, 9, 30, 10, 0, 5), 12), (datetime.datetime(2015, 9, 30, 10, 0, 2), 1), (datetime.datetime(2015, 9, 30, 10, 0, 10), 5), (datetime.datetime(2015, 9, 30, 10, 0, 1), 4), (datetime.datetime(2015, 9, 30, 10, 0, 6), 10), (datetime.datetime(2015, 9, 30, 10, 0, 7), 3)]
Comments
Post a Comment