Apache Spark - accessing internal data on RDDs? -
i started doing amp-camp 5 exercises. tried following 2 scenarios:
scenario #1
val pagecounts = sc.textfile("data/pagecounts") pagecounts.checkpoint pagecounts.count
scenario #2
val pagecounts = sc.textfile("data/pagecounts") pagecounts.count
the total time show in spark shell application ui different both scenarios.
scenario #1 took 0.5 seconds, while scenario #2 took 0.2 s
in scenario #1, checkpoint command nothing, it's neither transformation nor action. it's saying once rdd materializes after action, go ahead , save disk. missing here?
questions:
i understand scenario #1 taking more time, because rdd check-pointed (written disk). there way can know time taken checkpoint, total time?
spark shell application ui shows following - scheduler delay, task deserialization time, gc time, result serialization time, getting result time. but, doesn't show breakdown checkpointing.is there way access above metrics e.g. scheduler delay, gc time , save them programmatically? want log of above metrics every action invoked on rdd.
how can programmatically access following information:
- size of rdd, when persisted disk on checkpointing?
- how percentage of rdd in memory currently?
- overall time taken computing rdd?
please let me know if need more information.
spark rest api provides asking for.
some examples;
how percentage of rdd in memory currently?
get /api/v1/applications/[app-id]/storage/rdd/0
will responded with:
{ "id" : 0, "name" : "parallelcollectionrdd", "numpartitions" : 2, "numcachedpartitions" : 2, "storagelevel" : "memory deserialized 1x replicated", "memoryused" : 28000032, "diskused" : 0, "datadistribution" : [ { "address" : "localhost:54984", "memoryused" : 28000032, "memoryremaining" : 527755733, "diskused" : 0 } ], "partitions" : [ { "blockname" : "rdd_0_0", "storagelevel" : "memory deserialized 1x replicated", "memoryused" : 14000016, "diskused" : 0, "executors" : [ "localhost:54984" ] }, { "blockname" : "rdd_0_1", "storagelevel" : "memory deserialized 1x replicated", "memoryused" : 14000016, "diskused" : 0, "executors" : [ "localhost:54984" ] } ] }
overall time taken computing rdd?
to compute rdd called either job, stage, or attempt. get /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/tasksummary
will responded with:
{ "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ], "executordeserializetime" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ], "executorruntime" : [ 3.0, 3.0, 4.0, 4.0, 4.0 ], "resultsize" : [ 1457.0, 1457.0, 1457.0, 1457.0, 1457.0 ], "jvmgctime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ], "resultserializationtime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ], "memorybytesspilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ], "diskbytesspilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ], "shufflereadmetrics" : { "readbytes" : [ 340.0, 340.0, 342.0, 342.0, 342.0 ], "readrecords" : [ 10.0, 10.0, 10.0, 10.0, 10.0 ], "remoteblocksfetched" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ], "localblocksfetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ], "fetchwaittime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ], "remotebytesread" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ], "totalblocksfetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ] } }
your question broad, hence not respond all. believe spark has reflect reflected rest api.
Comments
Post a Comment