What: Fast calculation on R data structures
Why: Performance
How: Use data tables instead of data frames
Some days ago I struggled with performance topics on R. Task was the analysis of a larger text corpus (~600MB english text). The preprocessing was finally done in Java (made the difference between unusable (R) and quiet fast (Java)). The analysis itself was possible in R after switching to data tables instead of data frames.
The followig snippet shows an example of calculating the mean on subgroups of the data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | library(data.table) n<-10000000 test<-rnorm(n) fac<-sample(1:5, n, replace = T) df<-data.frame(col1=test, fac=fac) dt<-data.table(col1=test, fac=fac) f_df<-function(){ aggregate(col1 ~ fac, dt, mean) } f_df() system.time(f_df()) f_d<-function(){ dt[, mean(col1), by=fac] } f_dt() system.time(f_dt()) |
The results are:
Data structure | Time [s] |
---|---|
Data frame |
User System verstrichen 8.13 0.97 9.09 |
Data table |
User System verstrichen 0.12 0.01 0.14 |
Essentially, this is a factor of 80 on my machine (Win8.1, 8GB, i5).
Where to go next: Read this interesting blog post regarding data tables and performance. Also have a look at the fread function for reading in data. Also, have a look at other nice features of data tables like keys or assignment of columns.