Question

Strange fasta file

0

Entering edit mode

2.8 years ago

Vasiliy Krestov ▴ 30

Got this fasta file when tried to gather fasta protein sequences using list of accessions as a query, local nr database and blastdbcmd tool from blast+ suit.

Command: blastdbcmd -db nr -dbtype prot -entry_batch list_of_accessions -outfmt %f -out fasta.fa

Will working with this file cause problems? Have you ever met such format?

this

fasta blastdbcmd • 851 views

ADD COMMENT • link updated 2.8 years ago by cpad0112 21k • written 2.8 years ago by Vasiliy Krestov ▴ 30

0

Entering edit mode

May be you need a script to parse such a fasta file where the header is split into multiple headers and copy the sequence in all those headers.

Example fasta:

>a.fasta >b.fasta
LYPASG

Resultant fasta:

>a.fasta
LYPASG
>b.fasta
LYPASG

ADD REPLY • link 2.8 years ago by cpad0112 21k

0

Entering edit mode

2.8 years ago

colindaven 7.0k

I would be a little worried about the multiple ">" across the header.

I used to deal with a lot of problems with external fasta files breaking internal blast databases or subsequent tools, and wrote biopython scripts to replace all special characters in the headers, delete short sequences (eg empty ones), etc.

Can provide these if interested?

ADD COMMENT • link 2.8 years ago by colindaven 7.0k

score 3 · Accepted Answer · 2022-03-10

Nothing unusual about this file other than the fact that it has a long header line. It is because all those PDB entries (5XOU_B, 5XQ2_B, etc) are of the same protein, but either crystallized in different crystal group or with different ligands. They get consolidated in the nr database so that their annotations are retained in the header, but only one sequence represents all of them.

The only potential problem would be if you opened the file in a text editor and the header was actually stretching across 10 or so lines. I think that will still be one line, but looks like many because of text wrapping when printed on the screen.