scala - How is the performance impact of select statements on Spark DataFrames? -


using many select statements or expressions on spark dataframes, wonder performance impact on subsequent transformations once triggered action.

given dataframe df 10 columns j.

  1. how influence if use as column renaming on each column?

    df.select( df("a").as("1"), ..., df("j").as("10"))

  2. what if select subset (e.g. 5 columns)

    val df2 = df.select( df("a"), ..., df("e") )

    b. how handles spark projection? df still kept (as df2 projection) df serve kind of reference? or instead df2 created freshly , df discarded? (neglecting persist here)

  3. how influence of general column expressions used in select?

  4. are performance tests above cases available? , performance measurements in general somewhere available? if not, how measure performance best?


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -