Tuesday, December 6, 2011

R: Empirical Cumulative Distribution Function

There's a handy R built-in function for calculating the empirical cdf called ecdf.  I had 40 separate distributions (molecular weight for different series of molecules) and wanted to look at the cdf for this.  In R it's really straightforward.  Here's an example where we simulate some molecular weight data and plot the CDF for the whole set:

# create the data
R> data <- data.frame(mw=rnorm(10000, mean=400, sd=50), name=1:10000, group=sample(c('a','b','c','d'), 1000, replace=T))
R> mw_cdf <- data.frame(cdf=ecdf(data$mw)(data$mw)*100, mw=data$mw)

R> qplot(mw, cdf, data=mw_cdf, geom="step", xlab="Molecular weight", ylab="Total (%)", main="CDF of MW")

The ecdf() call returns a function (of class "ecdf") which you can call with a MW to calculate the probability of seeing a molecule of this, or smaller, MW in your set.

The R code above produces the following plot:

I found this useful for looking at the MW distributions for disparate groups of molecules, both together and separately.  If you're going to apply a molecular weight cutoff when analysing a set of molecules, it's good to get an idea of how many you'll be excluding and this provides a very quick way of seeing that information.

Thursday, December 1, 2011

Running chrome on multiple displays

By display here I mean $DISPLAY not multiple monitors.

I often leave Chrome running at work (on a ton of virtual desktops with a ton of tabs) and then get home and need to log into work (via VPN + NX) in order to check on things.  It's frustrated me that if I try to run chrome in the NX session then it appears on my screen at work rather than in the NX session.  Well, who would have thought that reading the man page would solve the frustration?  This method sets up a new profile, so you loose all  your bookmarks and settings, but that's ok for occasional use I think
/opt/google/chrome/google-chrome --user-data-dir=<dir>
Where <dir> is a dir of your choosing. The default for chrome is ~/.config/google-chrome/. So I've been using ~/.config/google-chrome/nx. Phew, friction resolved.

Tuesday, October 18, 2011

Using linux 'seq' with large numbers

Sometimes I deal with quite a lot of data (IMO). Occasionally this has to be split into smaller files in order to be processed. I use 'seq' quite a lot for generating program lists to work on these files.

I ran into an issue the other day; I wanted to include the line-number offset in the filename of the files I was generating, unfortunately as soon as I hit a million lines the 'seq' numbers started to use scientific notation (i.e. rather than 1000000 seq output 1e+6) - unfortunately, this wasn't compatible with some of the downstream processing.

The 'seq' manpage seemed to claims that it accepts printf formatting arguments. So I tried running 'seq -f "%d" 0 10000 160000000' and 'seq -f "%i" 0 10000 160000000' but neither of these were recognised. It turns out that seq actually only recognises the printf style floating-point format... so to get it to work as desired you have to use "%.0f" instead:

> seq -f "%.0f" 1000000 1000000 10000000

Tuesday, September 20, 2011

"Rule Engine" Knime node - matching missing values

I just spent fifteen minutes trying to work out how to match missing values in the Knime "Rule Engine" node. It turns out you put the "MISSING" keyword before the column definition.

So, the rule looks something like this:

MISSING $species$ => 'blank'

Thursday, May 5, 2011

R: invalid multibyte string

I've just been using R for some string comparison work (pairing the titles from two reference sets where the formatting was quite different). I bashed my head against the wall for a little while trying to figure out why the analysis was failing with a "invalid mulibyte string 1" error.

Ok, the error itself is fairly obvious - the supposedly plain text strings contained some non-ASCII characters. Finding those characters was the issue. I tried grep ("[^:print:]" and a few others) and nedit but couldn't see the issue. Turns out just using less solved my problems. The non-ASCII characters are highlighted for all to see (so they're easy to remove using a text editor at that point).

Oh, and if you're interested I was using stringdot from the 'kernlab' package to compare the strings.

Wednesday, April 6, 2011

View a summary of filetypes in a dir

Sometimes I fill up directories with lots of files of lots of types (different extensions). It's informative to be able to see what the distribution of types is in a dir. I tend to use this view during project cleanups to identify files that can be deleted or moved/consolidated. Anyway, here's a quick way of doing it:

for file in $(ls *.*); do base=$(basename $file); echo ${base##*\.};done | sort | uniq -c | sort -n -r
  37 tsv
  18 pdf
  17 txt
   7 err
   5 log
   5 png

Here I see that there are some log files that can be removed (if no longer needed).

Sunday, February 13, 2011

Reading stdin from a pipe for command-line R shenanigans

I think I rely on command-line tomfoolery a little bit too much. Today I wanted to pipe some data into R and have the commands to run defined on the command line as well. This is the kind of thing I do with Perl or bash to get some quick answers and I'd love to add R to the repertoire.

So, I tried a number of things all of which failed. This is how I got it to work for me:
> perl -le 'printf "%.4f\t%.4f\n", rand(), rand() for 1 .. 20' \
 | R --vanilla --slave -e\
 "data=read.delim(pipe('cat /dev/stdin'), header=F);\
  cor.test(data\$V1, data\$V2)"

You have to remember to escape any special characters in the R script ($ in this case).

Wednesday, January 12, 2011

R: using lattice graphs in functions

I like the lattice graphics package; it produces nice looking and easy to generate graphs. However, I was bitten the other day when a function I'd written to generate a load of density plots wasn't creating the output files I was expecting. They just weren't there at all. After a little bit of unsuccessful poking around, I gave up and generated them manually (it was a stressful day with tight deadlines).

I went back to this issue today and it turns out that the lattice functions 'do not create plots'. You have to print the object created from the lattice function in order to see/output the graph.

This happens automatically on the command line, which is why I was so confused since the code would work fine outside of a function. If you want your graph to be outputted within a function you have to explicitly print it.


function write_density_plot(data, name)
    filename = paste(name, '_density_distribution.png', sep="")
    print(densityplot(data, main=paste(name, 'score distribution'))

filedata = read.delim('myfile', header=T)
write_density_plot(filedata$blorg, 'blorg')
write_density_plot(filedata$blarg, 'blarg')
write_density_plot(filedata$wib, 'wib')