python - Naming Variables using Pyspark -


even though problem pretty simple, since i'm new spark having issues resolving it.

the normal python query issue following:

for line in file('schedule.txt'):   origin,dest,depart,arrive,price=line.split(',') 

i read file as

sched=sc.textfile('/path/schedule.txt') 

but when trying following code:

  origin,dest,depart,arrive,price=sched.split(',') 

i'm getting error:

--------------------------------------------------------------------------- attributeerror                            traceback (most recent call last) <ipython-input-46-ba0e8c07ca89> in <module>() ----> 1 origin,dest,depart,arrive,price=sched.split(',')  attributeerror: 'rdd' object has no attribute 'split' 

i split file using lambda function. don't know how create 5 variable names.

if can please me.

sched=sc.textfile('/path/schedule.txt') returns rdd different datatype python file object , supports different api. equivalent of python code like:

sched=sc.textfile('/path/schedule.txt') # extract values vals = sched.map(lambda line:line.split(',')) # can processing, example sum price price = vals.reduce(lambda v1,v2:v1[4]+v2[4]) # or collect raw values raw_vals = vals.collect() 

update: if want able access values of each line local variables define dedicated function instead of lambda , pass .map():

def process_line(line):     origin,dest,depart,arrive,price=line.split(',')     # whatever     # remember return result  sche.map(process_line) 

update2:

the specific processing want on file not trivial because requires writing shared variable (flights). instead, i'd suggest grouping lines orig,dest, collecting results , inserting dict:

flights_data = sched.map(lambda line: ((line[0],line[1]),tuple(line[2:]))).groupbykey().collect() flights = {f:ds f,ds in flights_data} 

Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -