r - Speed up tapply with changing groups -
i writing function calculate difference in mean of 2 groups, groups changes each time, simple results, problem have rather large data set, speed key. "readable" version, using iris data example.
loopdif = function(nsim) { change = numeric(nsim) var = iris$sepal.length (i in 1:nsim){ randomspecies = sample(c("a","b"), length(var), replace=true) change[i] = diff(tapply(var, randomspecies, mean)) } return(change) } > system.time(loopdif(10000)) user system elapsed 2.06 0.00 2.06
the tried vectorise code:
slowdif <- function(nsim) { change = numeric(nsim) randomspecies = replicate(nsim,sample(c("a","b"), length(var), replace=true)) var = iris$sepal.length change = diff(unlist(lapply(split(randomspecies, col(randomspecies)), function(x) unlist(lapply(split(var, x), mean))))) return(change) } > system.time(slowdif(10000)) user system elapsed 1.42 0.00 1.42
it faster now, still not faster enough, hope make under 1 second, or 0.75 seconds. reason obsessed time because have deadline meet, current code isn't fast enough.
i tried profiling tells me unlist(lapply()) part bottleneck, have no idea how rewrite it.
i appreciate if provide me alternative, suggestions. thanks.
try this:
loopdif2 <- function(nsim) { change <- numeric(nsim) var <- iris$sepal.length nagroup<-rbinom(nsim,length(var),0.5) tot<-sum(var) (i in 1:nsim){ change[i]<-sum(var[sample(length(var),nagroup[i])]) } change/nagroup-(tot-change)/(length(var)-nagroup) }
in words: first extract number of elements of a
group, keeping b
group implicit. extract indices of a
group in each iteration. evaluate sum , divide number of elements mean. other sum total sum of variable less sum of a
group. mean of b
group evaluated.
performance on pc:
system.time(loopdif(10000)) # user system elapsed #3.855 0.004 3.867 system.time(loopdif2(10000)) # user system elapsed #0.139 0.000 0.139
Comments
Post a Comment