e.g. say you had an SDF, test_mols.sdf, with 8 molecules in it and you wanted individual mol files:
> csplit -kzsf "test_mols" -b %0d.mol test_mols.sdf /\$\$\$\$/+1 {*}
This would result in 8 files called test_mols00.mol through test_mols07.mol. Unfortunately these would still contain the SDF delimiter at the end of the file (so, technically these are still SDFs). That's pretty easy to clean up with something like:
> perl -ni -e 'print unless /\$\$\$\$/' *.mol
See the csplit manpage for more details.
I think Brandon deserves credit, too. Teamwork! :)
ReplyDeleteHi guys,
ReplyDeleteThanks a lot for your post. I had taken the perl route and was happy to find an alternative.
For info, I've had to double exit the $. The following line worked for me (suffix had to accommodate 4 digits):
csplit -kzsf "Prefix" -b %04d.mol ./Original.sdf /^\\$\\$\\$\\$/+1 {*}
Keep up the good work!
Hi all,
ReplyDeleteI'm new to shell scripting. If I wish to split a big sdf file into smaller sdf files with ~200 molecules per file, should I be writing:
csplit -kzsf "test_mols" -b %0b.mol test_mols.sdf /\$\$\$\$/+200 {*}
?
(or
csplit -kzsf "Prefix" -b %04d.mol ./Original.sdf /^\\$\\$\\$\\$/+200 {*}
?)
Thanks!
This comment has been removed by the author.
ReplyDeletehi
ReplyDeleteI am having a dataset of 2,00,000 compounds in SDF format. I want to split them into 50 subfiles. can u suggest me the command for performing this task as i am new to programming.
REGARDS !