Question

Split multi-fasta file and keep structure

0

Entering edit mode

3.7 years ago

genomes_and_MGEs ▴ 10

Hey everyone,

I have a multi-fasta file, and when I want to split into individual fasta files, I use a command like this

    cat myfile | awk '{
        if (substr($0, 1, 1)==">") {filename=(substr($0,2) ".fna")}
        print $0 > filename
}'

However, each individual fasta file represents a contig, and each contig belong to a given bacterial genome. So, if I have a multi-fasta like this

>PS_A_1
>PS_A_2
>PS_B_1
>PS_B_2

Using the above command will generate 4 individual fasta files. My objective is to split all files, so that PS_A_1 and PS_A_2 are concatenated in the same file (PS_A.fasta). The same for PS_B and so on.

Thanks a lot!

sequence • 878 views

ADD COMMENT • link updated 3.7 years ago by cpad0112 21k • written 3.7 years ago by genomes_and_MGEs ▴ 10

score 1 · Answer 1 · 2021-10-06

$ seqkit -w 0 split -i --id-regexp '(.*)_[0-9]+' test.fa
$ awk -F '[>_]' '/>/{getline seq; print $0"\n"seq>$2"_"$3".fa"}' test.fa

Please make sure that fasta is flattened if you are using awk function above. If sequences are multi line, use seqkit.

If you want the output to be in a single file (concat) instead of multiple files of the same name, try this:

$ awk -F '[>_]' -v OFS="\t" '/>/{getline seq; print $2"_"$3,seq}' test.fa | datamash -s -g1 collapse 2 | awk '{gsub(/,/,""); print ">"$1"\n"$2 > $1".fa"}'
$ seqkit replace -ip '_[0-9]+$'  -r "" test.fa  | seqkit fx2tab | datamash -sg 1 collapse 2 | sed 's/,//g' | seqkit tab2fx | seqkit split -i -O out

This would need datamash, available in most of the GNU-linux repos.