Profiling and comparing code performance in R
In our meeting on 24 February 2021 we discussed profiling and comparing of code running on data.table
versus data.frame.
Code and timing for examples are below.
comparison of multiple versions of R installed on teton (compiled against different linear algebra and math libraries), and possibly with different numbers of threads available (this could affect data.table; when run on linux and Windows data.table uses OpenMP, but on Macs only runs single-threaded; if you figure out how to get multi-threading working for
data.table
, let us know; presumably the number of threads needs to be specified in the slurm command on teton; by default you get only one thread indata.table
in R on teton).In preparation for this meeting, choose your own adventure, starting with the R code below, which we can profile in multiple ways (including
time()
). Additionally methods for profiling come from Rprof, to which Rstudio has built some interface. Come prepared to tell us what you did, what was interesting, what worked, etc. You are welcome to post additional code and comments here in advance of the meeting. If there is a lot of information shared by multiple contributors, we can move this to its own page.set up an interactive shell on teton with something like, where you replace account with your own. Note that we’re using a bit of memory, more than the default, so we ask for 4GB.
srun --pty --account="evolgen" --nodes=1 --mem=4G -t 0-0:25 /bin/bash
Assuming you have installed
data.table
, give this a whirl on your own computer, on teton with different versions of R (e.g.,module load r/4.0.2-intel
versusmodule load r/4.0.2-py27;
the latter is compiled with gcc and a different linear algebra system, rather than the intel compiler with MKL).library(data.table) set.seed(10101) x<-rnorm(10^7) y <- x * 5 + 2 + rnorm(n=length(x), sd=2) myDF<-data.frame(x=x, y=y) myDT<-data.table(x=x, y=y) system.time(myDFlm<-lm(y ~ x, data=myDF)) system.time(myDTlm<-lm(y ~ x, data=myDT))
Config | myDFlm time (user; seconds) | myDTlm time (user; seconds) |
---|---|---|
teton: | 2.464 | 2.305 |
teton: | 1.794 | 1.692 |
iMac from 2017; standard R build for MacOS | 1.990 | 1.541 |
system.time(foo <- subset(myDF, x < 10))
user system elapsed
3.035 0.846 4.260
system.time(foo <- subset(myDT, x < 10))
user system elapsed
0.486 0.374 0.842
## compare the data.table way of subsetting
> system.time(foo <- myDT[x < 10])
user system elapsed
0.126 0.059 0.185
> system.time(foo2 <- myDT[x < 10])
user system elapsed
0.087 0.031 0.119
## here is an example with grouping and keys
> grp <- sample(LETTERS[1:10], 10^7, replace = TRUE)
> myDT[, grp := grp]
> system.time(myDT[, as.list(coef(lm(y~x))), by = grp])
user system elapsed
1.349 0.176 1.390
> setkey(myDT, grp)
> system.time(myDT[, as.list(coef(lm(y~x))), by = grp])
user system elapsed
1.206 0.156 1.349
library(bench)
library(data.table)
library(dtplyr)
library(tidyverse)
set.seed(10101)
x <- rnorm(10^4)
y <- x * 5 + 2 + rnorm(n = length(x), sd = 2)
my_df <- data.frame(x = x, y = y)
my_dt <- data.table(x = x, y = y)
mark(
#Using dplyr with data.table (implicitly uses dtplyr)
data.table_tidy = filter(my_dt, x < 0),
#Using data.table syntax
data.table_dt = my_dt[x < 0],
#Using base subset
data.frame_subset = subset(my_df, x< 0),
#Using base with square brackets
data.frame_square = my_df[x < 0,],
iterations = 1000,
check = FALSE) %>%
select(-result:-gc)
# A tibble: 4 x 9
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 data.table_tidy 1.25ms 1.36ms 675. 5.12KB 8.19 988 12 1.46s
2 data.table_dt 317.1us 404.1us 2028. 192.15KB 2.03 999 1 492.61ms
3 data.frame_subset 393.5us 640.4us 1277. 493.93KB 1.28 999 1 782.53ms
4 data.frame_square 468.4us 490.5us 1768. 376.6KB 0 1000 0 565.74ms
Speed differences using different pipes:
Goofing around a bit more with R’s native “pipe”. Note that this is just for fun and using this syntax is strongly discouraged as it is difficult to understand. That said, a base R pipe, akin to the {magrittr}
pipe is coming.