Split multi-fasta file and keep structure
1
0
Entering edit mode
3.1 years ago

Hey everyone,

I have a multi-fasta file, and when I want to split into individual fasta files, I use a command like this

    cat myfile | awk '{
        if (substr($0, 1, 1)==">") {filename=(substr($0,2) ".fna")}
        print $0 > filename
}'

However, each individual fasta file represents a contig, and each contig belong to a given bacterial genome. So, if I have a multi-fasta like this

>PS_A_1
>PS_A_2
>PS_B_1
>PS_B_2

Using the above command will generate 4 individual fasta files. My objective is to split all files, so that PS_A_1 and PS_A_2 are concatenated in the same file (PS_A.fasta). The same for PS_B and so on.

Thanks a lot!

sequence • 757 views
ADD COMMENT
1
Entering edit mode
3.1 years ago
$ seqkit -w 0 split -i --id-regexp '(.*)_[0-9]+' test.fa
$ awk -F '[>_]' '/>/{getline seq; print $0"\n"seq>$2"_"$3".fa"}' test.fa

Please make sure that fasta is flattened if you are using awk function above. If sequences are multi line, use seqkit.

If you want the output to be in a single file (concat) instead of multiple files of the same name, try this:

$ awk -F '[>_]' -v OFS="\t" '/>/{getline seq; print $2"_"$3,seq}' test.fa | datamash -s -g1 collapse 2 | awk '{gsub(/,/,""); print ">"$1"\n"$2 > $1".fa"}'
$ seqkit replace -ip '_[0-9]+$'  -r "" test.fa  | seqkit fx2tab | datamash -sg 1 collapse 2 | sed 's/,//g' | seqkit tab2fx | seqkit split -i -O out

This would need datamash, available in most of the GNU-linux repos.

ADD COMMENT

Login before adding your answer.

Traffic: 2761 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6