This post is also available in: en
A few months ago, a friend ask me how to solve a problem about some calculations he was doing on a large matrix (on R).
Basically, he needed to compute a division operation by row, inside different groups: divide the first row by the second, the second for the third and so on, inside every group.
He needed speed, so even though it wasn't a "big data" issue or complex calculation, his actual solution wasn't fast enough (and also he was having some problems putting all the ideas together).
As always, when it comes to speed, nothing better that data.table
. First of all, some dummy data:
library(data.table)
set.seed(2389)
dummy <- data.table(matrix(runif(4*1000, 1, 10), ncol=4)) # random numeric data
dummy[, categoria:=rep(letters[1:10], each=100)] # groups (not random at all)
If you take a look to the dummy table, you'll see 4 numeric variables and one categorical, indicating the group
V1 V2 V3 V4 categoria
1: 2.058503 5.732043 7.341794 9.094358 a
2: 5.444102 6.628791 1.275599 9.518217 a
3: 5.376183 5.890163 2.244571 8.535461 a
4: 2.159202 9.335083 2.545126 9.363613 a
5: 4.348230 7.583110 3.694249 9.729261 a
---
996: 2.457340 3.847248 1.881991 4.937060 j
997: 6.142252 6.006729 5.023275 8.385980 j
998: 2.292033 9.564030 9.640693 6.288872 j
999: 1.247380 5.198429 9.424946 2.117781 j
1000: 9.826283 8.066924 8.141570 4.098256 j
And in one single line, you can compute what he needed:
dummy[, lapply(.SD, function(x) x/shift(x)), by=categoria]
categoria V1 V2 V3 V4
1: a NA NA NA NA
2: a 2.6446902 1.1564447 0.1737449 1.0466068
3: a 0.9875244 0.8885728 1.7596210 0.8967500
4: a 0.4016236 1.5848598 1.1339032 1.0970249
5: a 2.0138133 0.8123238 1.4514993 1.0390499
---
996: j 0.4927712 0.9087373 0.3187834 2.2825036
997: j 2.4995538 1.5613054 2.6691281 1.6985778
998: j 0.3731583 1.5922195 1.9192048 0.7499269
999: j 0.5442243 0.5435396 0.9776211 0.3367505
1000: j 7.8775396 1.5518003 0.8638321 1.9351651
And you can do the same for every other kind of lagged operation and you can use a bigger offset (1 to infinity-1) and move forward, backward. As always, you can get more info with ?data.table::shift
.
Thanks to data.table
for existing...