How to optimize lagged differences in data.table (r) -

i trying optimize snippet of r code calculate lagged differences using data.table in r. have 2 working solutions, both run painfully slow on real data (500 million row datasets). have enjoyed speedup , efficiency of using data.table generally, both solutions implemented quite slow (compared other data.table operations).

could offer suggestion more efficient coding practice in data.table specific task?

library(data.table) set.seed(1) id <- 1:10 date_samp <- seq.date(as.date("2010-01-01"),as.date("2011-01-01"),"days") dt1 <-    data.table(id = sample(id,size = 30,replace=t),              date_1 = sample(date_samp,size = 30,replace=t)) setkey(dt1,id,date_1) ### attempt lagged date ## attempt 1 dt1[,date_diff:=c(0,diff(date_1)),     by=id] ## attempt 2 ## works gives warnings dt1[,date_diff:=null] dt1[,n_group := .n,by=id] dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]),     by=id]

after bit more effort found "shift()" function on related question. have made data bit larger , done crude profiling, added few more approaches...but please update , provide different answer if there more efficient approach.

in response comments below added , changed things...attempt numeric (not integer), , keyed incorrect. added integer comparison , keyed integer (in addition numeric). looks converting date integer using "grouping each i" fastest solution.

library(data.table) set.seed(1) id <- 1:100 date_samp <- seq.date(as.date("2010-01-01"),as.date("2011-01-01"),"days") n_samp <- 1e7 dt1 <-    data.table(id = sample(id,size = n_samp,replace=t),              date_1 = sample(date_samp,size = n_samp,replace=t)) setkey(dt1,id,date_1) ### attempt lagged date ## attempt 1 dt1[,date_diff:=null] system.time(dt1[,date_diff:=c(0,diff(date_1)),     by=id]) ## attempt 2 dt1[,date_diff:=null] dt1[,n_group := .n,by=id] system.time(dt1[,date_diff:=c(0,date_1[2:n_group]-date_1[1:(n_group-1)]),     by=id]) ## attempt 3 dt1[,date_diff:=null] system.time(dt1[,date_diff:=date_1-shift(date_1),     by=id]) ## attempt 4 ## use numeric instead dt1[,date_diff:=null] dt1[,date_1num:=null] dt1[,date_1num:=as.numeric(date_1)] system.time(dt1[,date_diff:=date_1num-shift(date_1num),                 by=id]) ## attempt 5 ## use keyed dt_key <- unique(dt1[,list(id)]) dt1[,date_diff:=null] system.time(dt1[dt_key,     date_diff:=date_1num-shift(date_1num),     by=.eachi])  ## attempt 6 ## use integers instead dt1[,date_diff:=null] dt1[,date_1int:=as.integer(date_1)] system.time(dt1[,date_diff:=date_1int-shift(date_1int),                 by=id]) ## attempt 7 ## use integers keyed dt1[,date_diff:=null] dt1[,date_1int:=as.integer(date_1)] system.time(dt1[dt_key,                 date_diff:=date_1int-shift(date_1int),                 by=.eachi])   # attempt   user  system elapsed  # 1         0.34    0.25    0.59      # 2         0.37    0.28    0.67  # 3         0.25    0.16    0.41 # 4         0.11    0.01    0.13  # 5         0.06    0.03    0.10 # 6         0.09    0.00    0.09 # 7         0.05    0.00    0.04

Thr

Search This Blog

How to optimize lagged differences in data.table (r) -

Comments

Post a Comment