Strange fasta file
2
0
Entering edit mode
2.7 years ago

Got this fasta file when tried to gather fasta protein sequences using list of accessions as a query, local nr database and blastdbcmd tool from blast+ suit.

Command: blastdbcmd -db nr -dbtype prot -entry_batch list_of_accessions -outfmt %f -out fasta.fa

Will working with this file cause problems? Have you ever met such format?

this

fasta blastdbcmd • 818 views
ADD COMMENT
0
Entering edit mode

May be you need a script to parse such a fasta file where the header is split into multiple headers and copy the sequence in all those headers.

Example fasta:

>a.fasta >b.fasta
LYPASG

Resultant fasta:

>a.fasta
LYPASG
>b.fasta
LYPASG
ADD REPLY
3
Entering edit mode
2.7 years ago
Mensur Dlakic ★ 28k

Nothing unusual about this file other than the fact that it has a long header line. It is because all those PDB entries (5XOU_B, 5XQ2_B, etc) are of the same protein, but either crystallized in different crystal group or with different ligands. They get consolidated in the nr database so that their annotations are retained in the header, but only one sequence represents all of them.

The only potential problem would be if you opened the file in a text editor and the header was actually stretching across 10 or so lines. I think that will still be one line, but looks like many because of text wrapping when printed on the screen.

ADD COMMENT
0
Entering edit mode

Thank you!

ADD REPLY
0
Entering edit mode
2.7 years ago

I would be a little worried about the multiple ">" across the header.

I used to deal with a lot of problems with external fasta files breaking internal blast databases or subsequent tools, and wrote biopython scripts to replace all special characters in the headers, delete short sequences (eg empty ones), etc.

Can provide these if interested?

ADD COMMENT

Login before adding your answer.

Traffic: 1762 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6