R: Fast grouping

What: Fast calculation on R data structures
Why: Performance
How: Use data tables instead of data frames

Some days ago I struggled with performance topics on R. Task was the analysis of a larger text corpus (~600MB english text). The preprocessing was finally done in Java (made the difference between unusable (R) and quiet fast (Java)). The analysis itself was possible in R after switching to data tables instead of data frames.

The followig snippet shows an example of calculating the mean on subgroups of the data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
library(data.table)
 
n<-10000000
test<-rnorm(n)
fac<-sample(1:5, n, replace = T)
 
df<-data.frame(col1=test, fac=fac)
dt<-data.table(col1=test, fac=fac)
 
f_df<-function(){
  aggregate(col1 ~ fac, dt, mean)
}
f_df()
system.time(f_df())
 
f_d<-function(){
  dt[, mean(col1), by=fac]
}
f_dt()
system.time(f_dt())

The results are:

Data structure Time [s]
Data frame
User      System verstrichen 
8.13        0.97        9.09
Data table
User      System verstrichen 
0.12        0.01        0.14

Essentially, this is a factor of 80 on my machine (Win8.1, 8GB, i5).

Where to go next: Read this interesting blog post regarding data tables and performance. Also have a look at the fread function for reading in data. Also, have a look at other nice features of data tables like keys or assignment of columns.