scala - How is the performance impact of select statements on Spark DataFrames? -
using many select statements or expressions on spark dataframes, wonder performance impact on subsequent transformations once triggered action.
given dataframe df
10 columns j.
how influence if use
as
column renaming on each column?df.select( df("a").as("1"), ..., df("j").as("10"))
what if select subset (e.g. 5 columns)
val df2 = df.select( df("a"), ..., df("e") )
b. how handles spark projection?
df
still kept (asdf2
projection)df
serve kind of reference? or insteaddf2
created freshly ,df
discarded? (neglecting persist here)how influence of general
column
expressions used inselect
?are performance tests above cases available? , performance measurements in general somewhere available? if not, how measure performance best?
Comments
Post a Comment