Friday, March 9, 2012

Timing your R code

Ever wanted to find out which of a set of methods is faster in R?   Well, there's a very easy way to time your code: system.time.

For example: I wanted to compare the speed of using subset's "select" option over post restricting the full returned data.frame.

Here are examples showing the comparison I mean. Assume that "molecule_data" is a data.frame with at least one field (name) and that name_list is a vector of molecule names that I'm interested in.

Here is an example of using subset's "select" restriction mechanism
mol_names <-
   unique( subset(molecule_data, name %in% name_list, select="name") )
here is an example of restricting to a single column post subsetting:
mol_names <-
   unique( subset(molecule_data, name %in% name_list)$name )

I found out that, for my data, using subset's select option was ~50x faster.

system.time(
   mol_names <-
      unique(subset(molecule_data, name %in% name_list, name)))

 user  system elapsed
0.001  0.000  0.001

system.time(
   mol_names <- 
      unique(subset(molecule_data, name %in% name_list)$name)

 user  system elapsed
0.055  0.000  0.056
These timings are unreliable given how small they are (esp the first one), so lets run the operation a hundred times to get a better  estimate:
system.time(
   for(i in 1:100){
     mol_names <- unique(subset(molecule_data, name %in% name_list, name))
   }
)

 user  system elapsed
0.131  0.000  0.135

system.time(
   for(i in 1:100){
      mol_names <- unique(subset(molecule_data, name %in% name_list)$name
   }
)

 user  system elapsed
5.607  0.161  5.802

You can see that the time difference holds up over multiple runs.  Subset's "select" is the clear winner!