python - Naming Variables using Pyspark -
even though problem pretty simple, since i'm new spark having issues resolving it.
the normal python query issue following:
for line in file('schedule.txt'):   origin,dest,depart,arrive,price=line.split(',') i read file as
sched=sc.textfile('/path/schedule.txt') but when trying following code:
  origin,dest,depart,arrive,price=sched.split(',') i'm getting error:
--------------------------------------------------------------------------- attributeerror                            traceback (most recent call last) <ipython-input-46-ba0e8c07ca89> in <module>() ----> 1 origin,dest,depart,arrive,price=sched.split(',')  attributeerror: 'rdd' object has no attribute 'split' i split file using lambda function. don't know how create 5 variable names.
if can please me.
sched=sc.textfile('/path/schedule.txt') returns rdd different datatype python file object , supports different api. equivalent of python code like:
sched=sc.textfile('/path/schedule.txt') # extract values vals = sched.map(lambda line:line.split(',')) # can processing, example sum price price = vals.reduce(lambda v1,v2:v1[4]+v2[4]) # or collect raw values raw_vals = vals.collect() update: if want able access values of each line local variables define dedicated function instead of lambda , pass .map():
def process_line(line):     origin,dest,depart,arrive,price=line.split(',')     # whatever     # remember return result  sche.map(process_line) update2:
the specific processing want on file not trivial because requires writing shared variable (flights). instead, i'd suggest grouping lines orig,dest, collecting results , inserting dict:
flights_data = sched.map(lambda line: ((line[0],line[1]),tuple(line[2:]))).groupbykey().collect() flights = {f:ds f,ds in flights_data} 
Comments
Post a Comment