Collecting row stats from distributed matrix in Spark (Java) -
i trying multiply 2 large matrices , find indexes of biggest 20 or elements each row of resulting large (50000x50000) matrix. hope use spark java this. found can multiply distributed matrices if store them blockmatrices. otherwise there not seem more complex operations available distributed matrices , stuck. best way such operation? simple code have far looks this:
javasparkcontext sc = new javasparkcontext(conf); blockmatrix = getblockmatrixa(sc); blockmatrix b = getblockmatrixb(sc); blockmatrix ab = a.multiply(b);
something should thing, though have not tested yet:
rdd<indexedrow> byrow = ab.toindexedrowmatrix().rows(); javardd<indexedrow> dummyrdd = sc.parallelize(new arraylist<indexedrow>()); javardd<indexedrow> javabyrow = dummyrdd.wraprdd(byrow); javapairrdd<long, list<long>> top20byrow = javabyrow.maptopair( new pairfunction<indexedrow, long, list<long>>() { public tuple2<long, list<long>> call(indexedrow ir) throws exception { return new tuple2<long, list<long>>(ir.index(), gettopn(ir.vector(),n)); } });
Comments
Post a Comment