java - Aparapi GPU execution slower than CPU -
i trying test performance of aparapi. have seen blogs results show aparapi improve performance while doing data parallel operations.
but not able see in tests. here did, wrote 2 programs, 1 using aparapi, other 1 using normal loops.
program 1: in aparapi
import com.amd.aparapi.kernel; import com.amd.aparapi.range; public class app { public static void main( string[] args ) { final int size = 50000000; final float[] = new float[size]; final float[] b = new float[size]; (int = 0; < size; i++) { a[i] = (float) (math.random() * 100); b[i] = (float) (math.random() * 100); } final float[] sum = new float[size]; kernel kernel = new kernel(){ @override public void run() { int gid = getglobalid(); sum[gid] = a[gid] + b[gid]; } }; long t1 = system.currenttimemillis(); kernel.execute(range.create(size)); long t2 = system.currenttimemillis(); system.out.println("execution mode = "+kernel.getexecutionmode()); kernel.dispose(); system.out.println(t2-t1); } }
program 2: using loops
public class app2 { public static void main(string[] args) { final int size = 50000000; final float[] = new float[size]; final float[] b = new float[size]; (int = 0; < size; i++) { a[i] = (float) (math.random() * 100); b[i] = (float) (math.random() * 100); } final float[] sum = new float[size]; long t1 = system.currenttimemillis(); for(int i=0;i<size;i++) { sum[i]=a[i]+b[i]; } long t2 = system.currenttimemillis(); system.out.println(t2-t1); } }
program 1 takes around 330ms whereas program 2 takes around 55ms. doing wrong here? did printout execution mode in aparpai program , prints mode of execution gpu
you did not wrong - execpt benchmark itself.
benchmarking tricky, , particularly cases jit involved (as java), , libraries many nitty-gritty details hidden user (as aparapi). , in both cases, should @ least execute code section want benchmark multiple times.
for java version, 1 might expect computation time single execution of loop decrease when loop executed multiple times, due jit kicking in. there many additional caveats consider - details, should refer this answer. in simple test, effect of jit may not noticable, in more realistic or complex scenarios, make difference. anyhow: when repeating loop 10 times, time single execution of loop on machine 70 milliseconds.
for aparapi version, point of possible gpu initialization mentioned in comments. , here, indeed main problem: when running kernel 10 times, timings on machine are
1248 72 72 72 73 71 72 73 72 72
you see initial call causes overhead. reason that, during first call kernel#execute()
, has initializations (basically converting bytecode opencl, compile opencl code etc.). mentioned in documentation of kernelrunner
class:
the
kernelrunner
created lazily result of callingkernel.execute()
.
the effect of - namely, comparatively large delay first execution - has lead question on aparapi mailing list: a way eagerly create kernelrunners. workaround suggested there create "initialization call"
kernel.execute(range.create(1));
without real workload, trigger whole setup, subsequent calls fast. (this works example).
you may have noticed that, after initialization, aparapi version still not faster plain java version. reason task of simple vector addition memory bound - details, may refer this answer, explains term , issues gpu programming in general.
as overly suggestive example case might benefit gpu, might want modify test, in order create artificial compute bound task: when change kernel involve expensive trigonometric functions, this
kernel kernel = new kernel() { @override public void run() { int gid = getglobalid(); sum[gid] = (float)(math.cos(math.sin(a[gid])) + math.sin(math.cos(b[gid]))); } };
and plain java loop version accordingly, this
for (int = 0; < size; i++) { sum[i] = (float)(math.cos(math.sin(a[i])) + math.sin(math.cos(b[i])));; }
then see difference. on machine (geforce 970 gpu vs. amd k10 cpu) timings 140 milliseconds aparapi version, , whopping 12000 milliseconds plain java version - that's speedup of 90 through aparapi!
also note in cpu mode, aparapi may offer advantage compared plain java. on machine, in cpu mode, aparapi needs 2300 milliseconds, because still parallelizes execution using java thread pool.
Comments
Post a Comment