Thursday, May 5, 2011

R: invalid multibyte string

I've just been using R for some string comparison work (pairing the titles from two reference sets where the formatting was quite different). I bashed my head against the wall for a little while trying to figure out why the analysis was failing with a "invalid mulibyte string 1" error.

Ok, the error itself is fairly obvious - the supposedly plain text strings contained some non-ASCII characters. Finding those characters was the issue. I tried grep ("[^:print:]" and a few others) and nedit but couldn't see the issue. Turns out just using less solved my problems. The non-ASCII characters are highlighted for all to see (so they're easy to remove using a text editor at that point).

Oh, and if you're interested I was using stringdot from the 'kernlab' package to compare the strings.