Friday, June 21, 2013

Better tab use with bash

I use tab delimited files a lot. I hate comma separated values (and if you've ever had to deal with fields that could contain commas, you'll agree). If you like to use a good set of linux command line tools, then tabs sometimes cause problems.

One of the most viewed posts on this blog is the unix-join-with-tabs one where I describe the "Ctrl-v " method of inserting a tab character on the command line for things like cut -f':' --output-delimiter=<tab> where the output delimiter needs to be a tab.

However, there's a much nicer/easier way of specifying a tab character in bash: $'\t'. Take a look a the QUOTING section of the bash manpage for all the details.

Here's a simple example of it in action:

Tuesday, June 4, 2013

Using 'find' to list files with multiple suffixes

I'm going through another data cleanup session in a (very) old work directory tree. I found myself examining/compressing/removing files with a recurring set of suffixes. The 'find' command can make this all a lot less painful.

Here's the command for matching a single suffix:

But editing this command-line to match the next suffix of interest becomes tedious very quickly. Thankfully, you can chain together file tests like so (note the grouping):

Then acting on these files is easy - just update the -exec action to what you want (e.g. "-exec bzip2 {} \;" - you probably want to use xargs or "-exec bzip2 {} +" for this to reduce the number of command invocations)

An interesting note here is that the following command isn't executed as you might expect. The 'ls' command is only executed on the *.tsv files due to the way the expression is evaluated: from left to right with the implicit '-and' between the second '-name' and '-exec' exec having higher precedence than the '-or' between the two '-name' tests..