apache spark - PySpark RDD processing for sum of parts -


i have rdd tuples (datetime, integer). , try rdd of interval summation pyspark.

for example, followings

(2015-09-30 10:00:01, 3) (2015-09-30 10:00:02, 1) (2015-09-30 10:00:05, 2) (2015-09-30 10:00:06, 7) (2015-09-30 10:00:07, 3) (2015-09-30 10:00:10, 5) 

i'm trying followings sum of every 3 seconds:

(2015-09-30 10:00:01, 4)  # sum of 1, 2, 3 seconds (2015-09-30 10:00:02, 1)  # sum of 2, 3, 4 seconds (2015-09-30 10:00:05, 12) # sum of 5, 6, 7 seconds (2015-09-30 10:00:06, 10) # sum of 6, 7, 8 seconds (2015-09-30 10:00:07, 3)  # sum of 7, 8, 9 seconds (2015-09-30 10:00:10, 5)  # sum of 10, 11, 12 seconds 

please, give me hints?

i assume input rdd time_rdd tuples first element datetime object , second element integer. use flatmap map every datetime object previous 3 seconds , use reducebykey total count window.

def map_to_3_seconds(datetime_obj, count):     list_times = []     in range(-2, 1):         list_times.append((datetime_obj + timedelta(seconds = i), count))     return list_times  output_rdd = time_rdd.flatmap(lambda (datetime_obj, count): map_to_3_seconds(datetime_obj, count)).reducebykey(lambda x,y: x+y) 

this rdd contain more datetime objects ones in original rdd, if want have original times, need join time_rdd,

result = output_rdd.join(time_rdd).map(lambda (key, vals): (key, vals[0])).collect() 

now result contain:

[(datetime.datetime(2015, 9, 30, 10, 0, 5), 12), (datetime.datetime(2015, 9, 30, 10, 0, 2), 1), (datetime.datetime(2015, 9, 30, 10, 0, 10), 5), (datetime.datetime(2015, 9, 30, 10, 0, 1), 4), (datetime.datetime(2015, 9, 30, 10, 0, 6), 10), (datetime.datetime(2015, 9, 30, 10, 0, 7), 3)] 

Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -