Friday, June 21, 2013

Better tab use with bash

I use tab delimited files a lot. I hate comma separated values (and if you've ever had to deal with fields that could contain commas, you'll agree). If you like to use a good set of linux command line tools, then tabs sometimes cause problems.

One of the most viewed posts on this blog is the unix-join-with-tabs one where I describe the "Ctrl-v " method of inserting a tab character on the command line for things like cut -f':' --output-delimiter=<tab> where the output delimiter needs to be a tab.

However, there's a much nicer/easier way of specifying a tab character in bash: $'\t'. Take a look a the QUOTING section of the bash manpage for all the details.

Here's a simple example of it in action:

echo -e "second:first" |\
cut -d: --output-delimiter=$'\t' -f 1,2 |\
awk -F $'\t' 'BEGIN{OFS=FS}{print $2,$1}'
# outputs: "first<tab>second"
view raw gistfile1.sh hosted with ❤ by GitHub

Tuesday, June 4, 2013

Using 'find' to list files with multiple suffixes

I'm going through another data cleanup session in a (very) old work directory tree. I found myself examining/compressing/removing files with a recurring set of suffixes. The 'find' command can make this all a lot less painful.

Here's the command for matching a single suffix:
find . -name "*.csv" -exec ls -ls {} \;
view raw gistfile1.sh hosted with ❤ by GitHub

But editing this command-line to match the next suffix of interest becomes tedious very quickly. Thankfully, you can chain together file tests like so (note the grouping):
find . \( -name "*.csv" -or -name "*.tsv" \) -exec ls -lh {} \;
view raw gistfile1.sh hosted with ❤ by GitHub

Then acting on these files is easy - just update the -exec action to what you want (e.g. "-exec bzip2 {} \;" - you probably want to use xargs or "-exec bzip2 {} +" for this to reduce the number of command invocations)

An interesting note here is that the following command isn't executed as you might expect. The 'ls' command is only executed on the *.tsv files due to the way the expression is evaluated: from left to right with the implicit '-and' between the second '-name' and '-exec' exec having higher precedence than the '-or' between the two '-name' tests..
find . -name "*.csv" -or -name "*.tsv" -exec ls -lh {} \;
# this is actually executed as:
find . -name "*.csv" -or \( -name "*.tsv" -exec ls -lh {} \; \)
# when what you wanted was:
find . \( -name "*.csv" -or -name "*.tsv" \) -exec ls -lh {} \;
# the general recommendation is to use xargs inplace of the -exec action
find . -name "*.csv" -or -name "*.tsv" | xargs ls -lh