Splitting multifasta by string match
1
0
Entering edit mode
2.5 years ago
Winston ▴ 10

I know this question gets asked somewhat frequently, but I've yet to find an answer to my specific issue. I have a multifasta with headers that, while all somewhat different, each contain a unique ID that I want to use to append the sequences into their respective multifasta files.

My original file looks something like this:

seqs.fasta
>cds-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGGA
>rna-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGGAGGAG
>exon-123-1 transcript=rna-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGG
>cds-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGACAG
>rna-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGACAGGACC
>exon-456-1 transcript=rna-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGA

and I want to split this into multiple files based on the name= field while retaining the original header in the new file:

ABC1.fasta
>cds-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGGA
>rna-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGGAGGAG
>exon-123-1 transcript=rna-123 gene=gene-ABC1 name=ABC1 seq_id=123
GATCGG

DEF1.fasta
>cds-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGACAG
>rna-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGACAGGACC
>exon-456-1 transcript=rna-456 gene=gene-DEF1 name=DEF1 seq_id=456
GACCGA

I'm open to any and all solutions.

Thank you for your help!

fasta bash perl • 684 views
ADD COMMENT
1
Entering edit mode
2.5 years ago

A seqkit answer.

seqkit split -i --id-regexp "name=(\S+)" seqs.fasta
ADD COMMENT
1
Entering edit mode

Worked perfectly. Many thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2246 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6