GeekBrainDump

Monday, October 8, 2012

Mercurial file patterns

I wanted to add all the scripts in a directory tree to the local Mercurial repository. I was going to do something like this:

> find . -name \*.sh | xargs hg add

Which is nice enough (find and xargs go well together), but you can also do it using just Mercurial using patterns. e.g.

> hg add 'glob:**.sh'

The two most useful patterns (for me) are:

'*' match any text in the current directory only
'**' match anything in the entire tree

The patterns are much richer than this though; see hg help patterns for more on what's available (including regexes).

Tuesday, June 5, 2012

Removing VMWare Player - blank grey dialogue box

I've just spent way too long trying to upgrade VMWare Player on my laptop. The issue was that, when I tried to uninstall the incumbent version, I was presented with an unhelpful blank grey dialogue box... Notice the lack of, well, anything.

The box didn't go away , even after a couple of hours waiting - left it because I was wondering if something was being unpacked in the background. This kind of information is frustratingly difficult for me to get at on a windows machine... Eventually, I had to use Task Manager to kill it. I went through a few iterations of this trying out a few different ways of uninstalling or running the newer installer (even setting different default browsers since the contents of the box turned out to be HTML and I thought it might be a compatibility issue - but computer says "no").

I looked around for solutions & came across the following, which worked for me:

http://superuser.com/questions/245424/vmware-workstation-install-problem

The most pertinent advice was:

To uninstall any old version, go to C:\Windows\Installer
Add the "Authors" column and sort by it
One of the .msi files with have a "VMware" author
Double-click it and follow through with the uninstall steps

After uninstalling the older VMWare Player using this method, I was then able to install the latest version and get playing with my brand spanking new ACE image. Success!

Friday, March 9, 2012

Timing your R code

Ever wanted to find out which of a set of methods is faster in R? Well, there's a very easy way to time your code: system.time.

For example: I wanted to compare the speed of using subset's "select" option over post restricting the full returned data.frame.

Here are examples showing the comparison I mean. Assume that "molecule_data" is a data.frame with at least one field (name) and that name_list is a vector of molecule names that I'm interested in.

Here is an example of using subset's "select" restriction mechanism

mol_names <-
   unique( subset(molecule_data, name %in% name_list, select="name") )

here is an example of restricting to a single column post subsetting:

mol_names <-
   unique( subset(molecule_data, name %in% name_list)$name )

I found out that, for my data, using subset's select option was ~50x faster.

system.time(
   mol_names <-
      unique(subset(molecule_data, name %in% name_list, name)))

 user  system elapsed
0.001  0.000  0.001

system.time(
   mol_names <- 
      unique(subset(molecule_data, name %in% name_list)$name)

 user  system elapsed
0.055  0.000  0.056

These timings are unreliable given how small they are (esp the first one), so lets run the operation a hundred times to get a better estimate:

system.time(
   for(i in 1:100){
     mol_names <- unique(subset(molecule_data, name %in% name_list, name))
   }
)

 user  system elapsed
0.131  0.000  0.135

system.time(
   for(i in 1:100){
      mol_names <- unique(subset(molecule_data, name %in% name_list)$name
   }
)

 user  system elapsed
5.607  0.161  5.802

You can see that the time difference holds up over multiple runs. Subset's "select" is the clear winner!

Tuesday, December 6, 2011

R: Empirical Cumulative Distribution Function

There's a handy R built-in function for calculating the empirical cdf called ecdf. I had 40 separate distributions (molecular weight for different series of molecules) and wanted to look at the cdf for this. In R it's really straightforward. Here's an example where we simulate some molecular weight data and plot the CDF for the whole set:

# create the data
R> data <- data.frame(mw=rnorm(10000, mean=400, sd=50), name=1:10000, group=sample(c('a','b','c','d'), 1000, replace=T))
R> mw_cdf <- data.frame(cdf=ecdf(data$mw)(data$mw)*100, mw=data$mw)

R> qplot(mw, cdf, data=mw_cdf, geom="step", xlab="Molecular weight", ylab="Total (%)", main="CDF of MW")

The ecdf() call returns a function (of class "ecdf") which you can call with a MW to calculate the probability of seeing a molecule of this, or smaller, MW in your set.

The R code above produces the following plot:

I found this useful for looking at the MW distributions for disparate groups of molecules, both together and separately. If you're going to apply a molecular weight cutoff when analysing a set of molecules, it's good to get an idea of how many you'll be excluding and this provides a very quick way of seeing that information.

Thursday, December 1, 2011

Running chrome on multiple displays

By display here I mean $DISPLAY not multiple monitors.

I often leave Chrome running at work (on a ton of virtual desktops with a ton of tabs) and then get home and need to log into work (via VPN + NX) in order to check on things. It's frustrated me that if I try to run chrome in the NX session then it appears on my screen at work rather than in the NX session. Well, who would have thought that reading the man page would solve the frustration? This method sets up a new profile, so you loose all your bookmarks and settings, but that's ok for occasional use I think

/opt/google/chrome/google-chrome --user-data-dir=<dir>

Where <dir> is a dir of your choosing. The default for chrome is ~/.config/google-chrome/. So I've been using ~/.config/google-chrome/nx. Phew, friction resolved.

Tuesday, October 18, 2011

Using linux 'seq' with large numbers

Sometimes I deal with quite a lot of data (IMO). Occasionally this has to be split into smaller files in order to be processed. I use 'seq' quite a lot for generating program lists to work on these files.

I ran into an issue the other day; I wanted to include the line-number offset in the filename of the files I was generating, unfortunately as soon as I hit a million lines the 'seq' numbers started to use scientific notation (i.e. rather than 1000000 seq output 1e+6) - unfortunately, this wasn't compatible with some of the downstream processing.

The 'seq' manpage seemed to claims that it accepts printf formatting arguments. So I tried running 'seq -f "%d" 0 10000 160000000' and 'seq -f "%i" 0 10000 160000000' but neither of these were recognised. It turns out that seq actually only recognises the printf style floating-point format... so to get it to work as desired you have to use "%.0f" instead:

> seq -f "%.0f" 1000000 1000000 10000000
1000000 
2000000 
3000000 
4000000 
5000000 
6000000 
7000000 
8000000 
9000000 
10000000

Tuesday, September 20, 2011

"Rule Engine" Knime node - matching missing values

I just spent fifteen minutes trying to work out how to match missing values in the Knime "Rule Engine" node. It turns out you put the "MISSING" keyword before the column definition.

So, the rule looks something like this:

MISSING $species$ => 'blank'

GeekBrainDump

Monday, October 8, 2012

Mercurial file patterns

Tuesday, June 5, 2012

Removing VMWare Player - blank grey dialogue box

Friday, March 9, 2012

Timing your R code

Tuesday, December 6, 2011

R: Empirical Cumulative Distribution Function

Thursday, December 1, 2011

Running chrome on multiple displays

Tuesday, October 18, 2011

Using linux 'seq' with large numbers

Tuesday, September 20, 2011

"Rule Engine" Knime node - matching missing values

Blog Archive

Labels

About Me

Monday, October 8, 2012

Tuesday, June 5, 2012

Friday, March 9, 2012

Tuesday, December 6, 2011

Thursday, December 1, 2011

Tuesday, October 18, 2011

Tuesday, September 20, 2011

Subscribe To

Blog Archive

Labels

About Me