Apache Spark - accessing internal data on RDDs? -

June 15, 2013

i started doing amp-camp 5 exercises. tried following 2 scenarios:

scenario #1

val pagecounts = sc.textfile("data/pagecounts") pagecounts.checkpoint pagecounts.count

scenario #2

val pagecounts = sc.textfile("data/pagecounts") pagecounts.count

the total time show in spark shell application ui different both scenarios.
scenario #1 took 0.5 seconds, while scenario #2 took 0.2 s

in scenario #1, checkpoint command nothing, it's neither transformation nor action. it's saying once rdd materializes after action, go ahead , save disk. missing here?

questions:

i understand scenario #1 taking more time, because rdd check-pointed (written disk). there way can know time taken checkpoint, total time?
spark shell application ui shows following - scheduler delay, task deserialization time, gc time, result serialization time, getting result time. but, doesn't show breakdown checkpointing.
is there way access above metrics e.g. scheduler delay, gc time , save them programmatically? want log of above metrics every action invoked on rdd.
how can programmatically access following information:
- size of rdd, when persisted disk on checkpointing?
- how percentage of rdd in memory currently?
- overall time taken computing rdd?

please let me know if need more information.

spark rest api provides asking for.

some examples;

how percentage of rdd in memory currently?

get /api/v1/applications/[app-id]/storage/rdd/0

will responded with:

{   "id" : 0,   "name" : "parallelcollectionrdd",   "numpartitions" : 2,   "numcachedpartitions" : 2,   "storagelevel" : "memory deserialized 1x replicated",   "memoryused" : 28000032,   "diskused" : 0,   "datadistribution" : [ {     "address" : "localhost:54984",     "memoryused" : 28000032,     "memoryremaining" : 527755733,     "diskused" : 0   } ],   "partitions" : [ {     "blockname" : "rdd_0_0",     "storagelevel" : "memory deserialized 1x replicated",     "memoryused" : 14000016,     "diskused" : 0,     "executors" : [ "localhost:54984" ]   }, {     "blockname" : "rdd_0_1",     "storagelevel" : "memory deserialized 1x replicated",     "memoryused" : 14000016,     "diskused" : 0,     "executors" : [ "localhost:54984" ]   } ] }

overall time taken computing rdd?

to compute rdd called either job, stage, or attempt. get /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/tasksummary

will responded with:

{   "quantiles" : [ 0.05, 0.25, 0.5, 0.75, 0.95 ],   "executordeserializetime" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ],   "executorruntime" : [ 3.0, 3.0, 4.0, 4.0, 4.0 ],   "resultsize" : [ 1457.0, 1457.0, 1457.0, 1457.0, 1457.0 ],   "jvmgctime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],   "resultserializationtime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],   "memorybytesspilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],   "diskbytesspilled" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],   "shufflereadmetrics" : {     "readbytes" : [ 340.0, 340.0, 342.0, 342.0, 342.0 ],     "readrecords" : [ 10.0, 10.0, 10.0, 10.0, 10.0 ],     "remoteblocksfetched" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],     "localblocksfetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ],     "fetchwaittime" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],     "remotebytesread" : [ 0.0, 0.0, 0.0, 0.0, 0.0 ],     "totalblocksfetched" : [ 2.0, 2.0, 2.0, 2.0, 2.0 ]   } }

your question broad, hence not respond all. believe spark has reflect reflected rest api.

Search This Blog

TSQL

Apache Spark - accessing internal data on RDDs? -

Comments

Post a Comment

Popular posts from this blog

java - WARN : org.springframework.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/board/] in DispatcherServlet with name 'appServlet' -

android - How to create dynamically Fragment pager adapter -

1111. appearing after print sequence - php -