Monday, December 16, 2013

KNIME: duplicate column issue

Sometimes if you try to read in multiple SDFs with the "SDF Reader" node and "extract all properties" at the same time (in order to extract all the associated tag data) the node can fail with:

"Execute failed: Duplicate column name "<colname>" at positions 1 and 2" - (the column names and reported positions depend on the format of the input data).

I tried to fix up the columns thinking this was a whitespace issue or some such but had no joy. For me the solution was to read the molecules without extracting the data and then use the "SDF Extractor" node to grab the data. This worked with no issues.

Tuesday, November 19, 2013

ShrewSoft VPN Manager window does not show under 64Bit Windows 7

I'm using version 2.2.2 of the ShrewSoft VPN client on my 64bit Win7 laptop. For some reason, the 'Manage' window doesn't show in this version - which bit me when my VPN details changed. The .pcf configuration files aren't re-read if you edit them and so you have to have access to the 'manage' dialog in order to update your connection - you can't even add a connection without accessing this dialog.

So, the solution for me was actually quite simple -

  1. try to bring up the 'Manager' dialog (right click the ShreSoft icon in the bottom right of the taskbar and click on 'Manage')
  2. bring up the task manager (Ctrl-Shift-Esc is a handy short-cut for that) - the VPN manager should be present in the "Applications" window
  3. right click this and select 'maximize'
It might not be pretty, but at least it's now accessible.

Thursday, October 24, 2013

Compare a local and a remote file using 'diff' and process substitution.

I don't think I've posted this one before: I've been doing a lot of moving files around between local and remote servers and have made heavy use of the above idiom. First find the files that need to be manually inspected (rsync -ilvrn <localdir> <remotedir> ) and then use the above diff command to compare them.

Friday, June 21, 2013

Better tab use with bash

I use tab delimited files a lot. I hate comma separated values (and if you've ever had to deal with fields that could contain commas, you'll agree). If you like to use a good set of linux command line tools, then tabs sometimes cause problems.

One of the most viewed posts on this blog is the unix-join-with-tabs one where I describe the "Ctrl-v " method of inserting a tab character on the command line for things like cut -f':' --output-delimiter=<tab> where the output delimiter needs to be a tab.

However, there's a much nicer/easier way of specifying a tab character in bash: $'\t'. Take a look a the QUOTING section of the bash manpage for all the details.

Here's a simple example of it in action:

Tuesday, June 4, 2013

Using 'find' to list files with multiple suffixes

I'm going through another data cleanup session in a (very) old work directory tree. I found myself examining/compressing/removing files with a recurring set of suffixes. The 'find' command can make this all a lot less painful.

Here's the command for matching a single suffix:

But editing this command-line to match the next suffix of interest becomes tedious very quickly. Thankfully, you can chain together file tests like so (note the grouping):

Then acting on these files is easy - just update the -exec action to what you want (e.g. "-exec bzip2 {} \;" - you probably want to use xargs or "-exec bzip2 {} +" for this to reduce the number of command invocations)

An interesting note here is that the following command isn't executed as you might expect. The 'ls' command is only executed on the *.tsv files due to the way the expression is evaluated: from left to right with the implicit '-and' between the second '-name' and '-exec' exec having higher precedence than the '-or' between the two '-name' tests..

Wednesday, April 10, 2013

Finding (and fixing) files with undesirable permissions

I use a few programs that create files that are not group read/writable; this is an issue when working in a group environment (especially with Knime where important files are locked this way which prevents other users from even opening your workspaces). Here's a find command that'll locate these files: This command executes 'ls -lh' on each of the files to show the current permissions (would probably be more efficient to push into xargs to run ls on multiple files at once). The next find command will update the permissions on those files: Of course, you can just run chmod recursively (chmod -R g+rw *), but this isn't always what you want.

Friday, March 8, 2013

Merge images

I have a couple of sets of related density plots (each set with ~9 images) that I need to enter into a report. I'm feeling lazy today and don't want to go through the pain of uploading each of these to our internal wiki so I merged them instead.

The montage program (part of the ImageMagick suite of tools) is really simple and effective for this:

Hey presto a set of merged density plots laid out into two columns (regardless of the number of images) with some spacing between them and a bit of drop shadow to visually separate the plots.

Monday, February 18, 2013

OpenBabel: Convert SDF to SMILES and keep the data!

This seems like it should be the default option when converting from an SDF to a SMILES file - keep the damned data! Well, in openbabel (at least the version I'm running) this is not the default. If you want to keep the SD data you have to specify each tag as part of the '--append' argument. e.g.
> babel test1.sdf --append "cLogD7.4 cLogP model_score1 model_score2 some_other_property" test1.smi
In order to end up with a tab delimited file (my favourite) then you have to prefix the argument to 'append' with the desired character. I used "Ctrl-v <tab>" to get a tab in my string. Seems odd that tabs wouldn't be the default delimiter since there's still a tab used to separate the SMILES string from the molecule name in the standard conversion.

Friday, February 15, 2013

simple parallel processing with make

I've used xargs a fair bit for some simple, local, parallel processing. I was just recently reminded of another clever and simple solution that a friend came up with.

I'll let the code do the talking; here's the basic bash script (stored as an executable):
#!/bin/bash

if [[ $# -ne 1 ]]; then
   echo "Usage: cat commands.txt | $(basename $0) <num processes>"
   exit 1
fi

(while read line; do
  echo -e "$((++i)):\n\t$line";
done; 
echo "all:" $(seq 1 $i)) | make -B -j $1 -f <(cat -) all
This uses a couple of clever tricks. I especially like the use of process substitution in the make command (substituting the 'cat -' for the input makefile).

This approach allows the commands in commands.txt to redirect their own output as they need to (using '>', '2>', '&>', etc.)

Monday, February 11, 2013

Bash: while [[ -e /proc/$process_id ]]

Sometimes you need to keep track of the amount of memory a process is taking up (or CPU, or something else). The /proc filesystem contains a subdirectory for each running process; the directory is named with the pid of the process. You can use this fact along with a basic file test and a while loop to track a process for as long as it lives.
> some_interesting_job.py &

> process_id=$(ps -o "%p %c" | grep "some_interesting_job" | cut -f 1 -d ' ');\
 while [[ -e /proc/$process_id ]];\
 do ps -o "%z";\
 sleep 5;\
 done
This will report on the virtual memory size (in KiB - see 'ps' manpage for more details) that the process is taking up. The while loop will terminate when the process completes (or is killed).

Friday, January 25, 2013

commandlinefu.com

I just rediscovered this excellent site. Well worth the occasional browse:

There are some real gems in there! For example:

> python -m SimpleHTTPServer
Serve current directory tree at http://$HOSTNAME:8000/
Useful for tiding up your workspace whilst keeping jobs running:
> disown -a && exit
Close shell keeping all subprocess running
I do love a bit of process substitution:
> diff <(sort file1) <(sort file2)
diff two unsorted files without creating temporary files
Handy:
> rm !(*.foo|*.bar|*.baz)
Delete all files in a folder that don't match a certain file extension
And a "I should have thought of this; it's so obvious now!" trick:
> some_very_long_and_complex_command # label

Easy and fast access to often executed commands that are very long
and complex. When using reverse-i-search you have to type some 
part of the command that you want to retrieve. However, if the
command is very complex it might be difficult to recall the parts
that will uniquely identify this command. Using the above trick
it's possible to label your commands and access them easily by 
pressing ^R and typing the label (should be short and descriptive).

Thursday, January 24, 2013

R: formatting numbers for output

I often produce tables of numbers with a lot of significant digits after the decimal point. It's confusing to look at 20 sig [fd]igs, especially in a large table of results, so I tend to format my output to make it more concise and easier to read.

The function 'format' is pretty good for this. Note that the 'format' call returns a character vector (which is fine if you're only going to write the number to file or the console).

Here's a simple example just using a randomly generated number:

> num <- rnorm(1, mean=10)
> num
[1] 10.24339
We call format with digits=4 (show 4 significant digits) and nsmall=4 (display at least 4 digits after the decimal - for real/complex numbers in non-scientific format):
> format(num, digits=4, nsmall=4)
[1] "10.2434"
You can see that the format command rounds the numbers. This uses the IEC 60559 standard - 'go to the even digit'. So 0.5 is rounded to 0 and 1.5 is rounded to 2...

Of course, if you're used to sprintf style commands then you can also use the sprintf function for this:

> sprintf("%.4f", num)
[1] "10.2434"

Wednesday, January 2, 2013

Simple parallel processing with xargs

We've all been there - looking in a nicely tidied up directory, full of archived data - hundreds of lovely data files all gzipped or (better) bzip2'ed; but now you want to use them and you have to uncompress them all... "if only I could use all n CPUs on my local machine to do this!": enter 'xargs'!

It's as simple as:
> ls *.bz2 | xargs -n 1 -P 6 bunzip2
This will set off bunzip2 on all bz2 files in the current directory.

The '-n 1' flag tells xargs to only provide one argument (file) per command line; the '-P 6' tells xargs how many concurrent processes to run.