I would like to split this into several sub-multi fasta files, each containing all the segments from each organism and named after that organism. Also, preferably sorted in length descending...
@microfuge I find your suggestion very useful, but the files are saved without format specification. I´m new to awk and don´t get where to place the .fa in the script?
One idea would be to recover the headers (grep "^>" yourfile > headers). Split the headers into organism specific sub-files. Then use faSomeRecords utility from Jim Kent (UCSC) to get the data separated. Add execute permissions after downloading the utility (chmod a+x faSomeRecords).
This awk script could work but does not sort by length.
awk '{if(substr($0,1,1) == ">"){split(substr($0,2,length($0)),a,/_/);filename=a[1]};print $0 > filename }' your_input_file.fa
@microfuge I find your suggestion very useful, but the files are saved without format specification. I´m new to
awk
and don´t get where to place the.fa
in the script?One idea would be to recover the headers (
grep "^>" yourfile > headers
). Split the headers into organism specific sub-files. Then usefaSomeRecords
utility from Jim Kent (UCSC) to get the data separated. Add execute permissions after downloading the utility (chmod a+x faSomeRecords
).USEARCH has couple of options to sort fasta sequences by length.