i trying optimize snippet of r code calculate lagged differences using data.table in r. have 2 working solutions, both run painfully slow on real data (500 million row datasets). have enjoyed speedup , efficiency of using data.table generally, both solutions implemented quite slow (compared other data.table operations).
could offer suggestion more efficient coding practice in data.table specific task?
library(data.table) set.seed(1) id <- 1:10 date_samp <- seq.date(as.date("2010-01-01"),as.date("2011-01-01"),"days") dt1 <- data.table(id = sample(id,size = 30,replace=t), date_1 = sample(date_samp,size = 30,replace=t)) setkey(dt1,id,date_1) ### attempt lagged date ## attempt 1 dt1[,date_diff:=c(0,diff(date_1)), by=id] ## attempt 2 ## works gives warnings dt1[,date_diff:=null] dt1[,n_group := .n,by=id] dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]), by=id]
after bit more effort found "shift()" function on related question. have made data bit larger , done crude profiling, added few more approaches...but please update , provide different answer if there more efficient approach.
in response comments below added , changed things...attempt numeric (not integer), , keyed incorrect. added integer comparison , keyed integer (in addition numeric). looks converting date integer using "grouping each i" fastest solution.
library(data.table) set.seed(1) id <- 1:100 date_samp <- seq.date(as.date("2010-01-01"),as.date("2011-01-01"),"days") n_samp <- 1e7 dt1 <- data.table(id = sample(id,size = n_samp,replace=t), date_1 = sample(date_samp,size = n_samp,replace=t)) setkey(dt1,id,date_1) ### attempt lagged date ## attempt 1 dt1[,date_diff:=null] system.time(dt1[,date_diff:=c(0,diff(date_1)), by=id]) ## attempt 2 dt1[,date_diff:=null] dt1[,n_group := .n,by=id] system.time(dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]), by=id]) ## attempt 3 dt1[,date_diff:=null] system.time(dt1[,date_diff:=date_1-shift(date_1), by=id]) ## attempt 4 ## use numeric instead dt1[,date_diff:=null] dt1[,date_1num:=null] dt1[,date_1num:=as.numeric(date_1)] system.time(dt1[,date_diff:=date_1num-shift(date_1num), by=id]) ## attempt 5 ## use keyed dt_key <- unique(dt1[,list(id)]) dt1[,date_diff:=null] system.time(dt1[dt_key, date_diff:=date_1num-shift(date_1num), by=.eachi]) ## attempt 6 ## use integers instead dt1[,date_diff:=null] dt1[,date_1int:=as.integer(date_1)] system.time(dt1[,date_diff:=date_1int-shift(date_1int), by=id]) ## attempt 7 ## use integers keyed dt1[,date_diff:=null] dt1[,date_1int:=as.integer(date_1)] system.time(dt1[dt_key, date_diff:=date_1int-shift(date_1int), by=.eachi]) # attempt user system elapsed # 1 0.34 0.25 0.59 # 2 0.37 0.28 0.67 # 3 0.25 0.16 0.41 # 4 0.11 0.01 0.13 # 5 0.06 0.03 0.10 # 6 0.09 0.00 0.09 # 7 0.05 0.00 0.04
Comments
Post a Comment