Apache Spark - accessing internal data on RDDs? -


i started doing amp-camp 5 exercises. tried following 2 scenarios:

scenario #1

val pagecounts = sc.textfile("data/pagecounts") pagecounts.checkpoint pagecounts.count 

scenario #2

val pagecounts = sc.textfile("data/pagecounts") pagecounts.count 

the total time show in spark shell application ui different both scenarios.
scenario #1 took 0.5 seconds, while scenario #2 took 0.2 s

in scenario #1, checkpoint command nothing, it's neither transformation nor action. it's saying once rdd materializes after action, go ahead , save disk. missing here?

questions:

  1. i understand scenario #1 taking more time, because rdd check-pointed (written disk). there way can know time taken checkpoint, total time?
    spark shell application ui shows following - scheduler delay, task deserialization time, gc time, result serialization time, getting result time. but, doesn't show breakdown checkpointing.

  2. is there way access above metrics e.g. scheduler delay, gc time , save them programmatically? want log of above metrics every action invoked on rdd.

  3. how can programmatically access following information:

    • size of rdd, when persisted disk on checkpointing?
    • how percentage of rdd in memory currently?
    • overall time taken computing rdd?

please let me know if need more information.

spark rest api provides asking for.

some examples;

how percentage of rdd in memory currently?

get /api/v1/applications/[app-id]/storage/rdd/0

will responded with:

{   "id" : 0,   "name" : "parallelcollectionrdd",   "numpartitions" : 2,   "numcachedpartitions" : 2,   "storagelevel" : "memory deserialized 1x replicated",   "memoryused" : 28000032,   "diskused" : 0,   "datadistribution" : [ {     "address" : "localhost:54984",     "memoryused" : 28000032,     "memoryremaining" : 527755733,     "diskused" : 0   } ],   "partitions" : [ {     "blockname" : "rdd_0_0",     "storagelevel" : "memory deserialized 1x replicated",     "memoryused" : 14000016,     "diskused" : 0,     "executors" : [ "localhost:54984" ]   }, {     "blockname" : "rdd_0_1",     "storagelevel" : "memory deserialized 1x replicated",     "memoryused" : 14000016,     "diskused" : 0,     "executors" : [ "localhost:54984" ]   } ] } 

overall time taken computing rdd?

to compute rdd called either job, stage, or attempt. get /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/tasksummary

will responded with:

{   "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],   "executordeserializetime" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ],   "executorruntime" : [ 3.0, 3.0, 4.0, 4.0, 4.0 ],   "resultsize" : [ 1457.0, 1457.0, 1457.0, 1457.0, 1457.0 ],   "jvmgctime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],   "resultserializationtime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],   "memorybytesspilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],   "diskbytesspilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],   "shufflereadmetrics" : {     "readbytes" : [ 340.0, 340.0, 342.0, 342.0, 342.0 ],     "readrecords" : [ 10.0, 10.0, 10.0, 10.0, 10.0 ],     "remoteblocksfetched" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],     "localblocksfetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ],     "fetchwaittime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],     "remotebytesread" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],     "totalblocksfetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ]   } } 

your question broad, hence not respond all. believe spark has reflect reflected rest api.


Comments

Popular posts from this blog

html - Outlook 2010 Anchor (url/address/link) -

javascript - Why does running this loop 9 times take 100x longer than running it 8 times? -

Getting gateway time-out Rails app with Nginx + Puma running on Digital Ocean -