Tuesday, December 6, 2011

R: Empirical Cumulative Distribution Function

There's a handy R built-in function for calculating the empirical cdf called ecdf.  I had 40 separate distributions (molecular weight for different series of molecules) and wanted to look at the cdf for this.  In R it's really straightforward.  Here's an example where we simulate some molecular weight data and plot the CDF for the whole set:

# create the data
R> data <- data.frame(mw=rnorm(10000, mean=400, sd=50), name=1:10000, group=sample(c('a','b','c','d'), 1000, replace=T))
R> mw_cdf <- data.frame(cdf=ecdf(data$mw)(data$mw)*100, mw=data$mw)

R> qplot(mw, cdf, data=mw_cdf, geom="step", xlab="Molecular weight", ylab="Total (%)", main="CDF of MW")


The ecdf() call returns a function (of class "ecdf") which you can call with a MW to calculate the probability of seeing a molecule of this, or smaller, MW in your set.


The R code above produces the following plot:


I found this useful for looking at the MW distributions for disparate groups of molecules, both together and separately.  If you're going to apply a molecular weight cutoff when analysing a set of molecules, it's good to get an idea of how many you'll be excluding and this provides a very quick way of seeing that information.

No comments:

Post a Comment