python - Naming Variables using Pyspark -
even though problem pretty simple, since i'm new spark having issues resolving it.
the normal python query issue following:
for line in file('schedule.txt'): origin,dest,depart,arrive,price=line.split(',')
i read file as
sched=sc.textfile('/path/schedule.txt')
but when trying following code:
origin,dest,depart,arrive,price=sched.split(',')
i'm getting error:
--------------------------------------------------------------------------- attributeerror traceback (most recent call last) <ipython-input-46-ba0e8c07ca89> in <module>() ----> 1 origin,dest,depart,arrive,price=sched.split(',') attributeerror: 'rdd' object has no attribute 'split'
i split file using lambda function. don't know how create 5 variable names.
if can please me.
sched=sc.textfile('/path/schedule.txt')
returns rdd
different datatype python file object , supports different api. equivalent of python code like:
sched=sc.textfile('/path/schedule.txt') # extract values vals = sched.map(lambda line:line.split(',')) # can processing, example sum price price = vals.reduce(lambda v1,v2:v1[4]+v2[4]) # or collect raw values raw_vals = vals.collect()
update: if want able access values of each line local variables define dedicated function instead of lambda , pass .map()
:
def process_line(line): origin,dest,depart,arrive,price=line.split(',') # whatever # remember return result sche.map(process_line)
update2:
the specific processing want on file not trivial because requires writing shared variable (flights
). instead, i'd suggest grouping lines orig,dest
, collecting results , inserting dict:
flights_data = sched.map(lambda line: ((line[0],line[1]),tuple(line[2:]))).groupbykey().collect() flights = {f:ds f,ds in flights_data}
Comments
Post a Comment