This post is also available in: en

A few months ago, a friend ask me how to solve a problem about some calculations he was doing on a large matrix (on R).

Basically, he needed to compute a division operation by row, inside different groups: divide the first row by the second, the second for the third and so on, inside every group.

He needed speed, so even though it wasn't a "big data" issue or complex calculation, his actual solution wasn't fast enough (and also he was having some problems putting all the ideas together).

As always, when it comes to speed, nothing better that data.table. First of all, some dummy data:

library(data.table)

set.seed(2389)
dummy <- data.table(matrix(runif(4*1000, 1, 10), ncol=4))   # random numeric data
dummy[, categoria:=rep(letters[1:10], each=100)]            # groups (not random at all)

If you take a look to the dummy table, you'll see 4 numeric variables and one categorical, indicating the group

            V1       V2       V3       V4 categoria
   1: 2.058503 5.732043 7.341794 9.094358         a
   2: 5.444102 6.628791 1.275599 9.518217         a
   3: 5.376183 5.890163 2.244571 8.535461         a
   4: 2.159202 9.335083 2.545126 9.363613         a
   5: 4.348230 7.583110 3.694249 9.729261         a
  ---                                              
 996: 2.457340 3.847248 1.881991 4.937060         j
 997: 6.142252 6.006729 5.023275 8.385980         j
 998: 2.292033 9.564030 9.640693 6.288872         j
 999: 1.247380 5.198429 9.424946 2.117781         j
1000: 9.826283 8.066924 8.141570 4.098256         j

And in one single line, you can compute what he needed:

dummy[, lapply(.SD, function(x) x/shift(x)), by=categoria]
      categoria        V1        V2        V3        V4
   1:         a        NA        NA        NA        NA
   2:         a 2.6446902 1.1564447 0.1737449 1.0466068
   3:         a 0.9875244 0.8885728 1.7596210 0.8967500
   4:         a 0.4016236 1.5848598 1.1339032 1.0970249
   5:         a 2.0138133 0.8123238 1.4514993 1.0390499
  ---                                                  
 996:         j 0.4927712 0.9087373 0.3187834 2.2825036
 997:         j 2.4995538 1.5613054 2.6691281 1.6985778
 998:         j 0.3731583 1.5922195 1.9192048 0.7499269
 999:         j 0.5442243 0.5435396 0.9776211 0.3367505
1000:         j 7.8775396 1.5518003 0.8638321 1.9351651

And you can do the same for every other kind of lagged operation and you can use a bigger offset (1 to infinity-1) and move forward, backward. As always, you can get more info with ?data.table::shift.

Thanks to data.table for existing...


Published

Category

R

Tags

Contacto