Thursday, February 11, 2010

Use csplit to split SDF files (or contextually split any file)

Say you want to split an SDF into individual entities, you could write a Perl script/one-liner (which is what I've been doing for a long time) or you could just use csplit. Thanks to Pat and Jessen for pointing this one out.

e.g. say you had an SDF, test_mols.sdf, with 8 molecules in it and you wanted individual mol files:

> csplit -kzsf "test_mols" -b %0d.mol test_mols.sdf /\$\$\$\$/+1 {*}


This would result in 8 files called test_mols00.mol through test_mols07.mol. Unfortunately these would still contain the SDF delimiter at the end of the file (so, technically these are still SDFs). That's pretty easy to clean up with something like:

> perl -ni -e 'print unless /\$\$\$\$/' *.mol


See the csplit manpage for more details.

5 comments:

  1. I think Brandon deserves credit, too. Teamwork! :)

    ReplyDelete
  2. Hi guys,

    Thanks a lot for your post. I had taken the perl route and was happy to find an alternative.

    For info, I've had to double exit the $. The following line worked for me (suffix had to accommodate 4 digits):
    csplit -kzsf "Prefix" -b %04d.mol ./Original.sdf /^\\$\\$\\$\\$/+1 {*}

    Keep up the good work!

    ReplyDelete
  3. Hi all,
    I'm new to shell scripting. If I wish to split a big sdf file into smaller sdf files with ~200 molecules per file, should I be writing:
    csplit -kzsf "test_mols" -b %0b.mol test_mols.sdf /\$\$\$\$/+200 {*}
    ?
    (or
    csplit -kzsf "Prefix" -b %04d.mol ./Original.sdf /^\\$\\$\\$\\$/+200 {*}
    ?)
    Thanks!

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. hi
    I am having a dataset of 2,00,000 compounds in SDF format. I want to split them into 50 subfiles. can u suggest me the command for performing this task as i am new to programming.
    REGARDS !

    ReplyDelete