Question

FASTA Headers Not Useful from Database Download

0

Entering edit mode

7.9 years ago

mollysil ▴ 40

Hi,

I am trying to BLAST many reads against the MaarjAM database (http://maarjam.botany.ut.ee/?action=sTax), a database strictly for arbuscular mycorrhizal fungi sequences. I was able to download the FASTA sequences from the database into a text file and then converted that into a blast-able database using "makeblastdb -in MaarjAM_18s_seq.txt -out MaarjAMdb -dbtype nucl". The main problem is that the headers in the original file are not useful. They look like this:

>gb|AB076274_2004_Saito,_M._GlAc2.1_VTX00166
GGGACATCATGTCGGTCGTGCCTCGGTACGTACTGGTATTGTTGGTTTCTCCCTTCTGACGAACCATGATGTCATTTATT
TGGTGTTGTGGGGAATCAGGACTGTTACTTTGAAAA
>gb|LN620567_2015_Davison,_J._sp._VTX00311
AGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTCGGGGTC
AGTAGATTGGTCGTGCCACTGGTACGTACTGGTCTTACTGATTCCTCCCTCCTGATGAACTGTAATGCCATTAAT

The headers list out the publication information for the sequence instead of having useful information like the Genus and species names (which is what I need it to say!). I use Blast+ for BLASTing against the NCBI database and it works fine. But using Blast+ for this database is not giving me taxon assignments. Instead, I get alignments with no assignments. Any ideas for fixing this problem easily, without much in the way of new software or package downloads? I am using Linux and I have been using MEGAN to import the blast files. I need to change the headers to provide taxonomic information OR figure out another way to get proper taxon assignments!

Thanks! Molly

fungi AMF MaarjAM headers Blast+ • 2.3k views

ADD COMMENT • link updated 7.9 years ago by wjidea ▴ 50 • written 7.9 years ago by mollysil ▴ 40

score 1 · Answer 1 · 2016-12-16

1

Entering edit mode

7.9 years ago

wjidea ▴ 50

It seems like the last part of the sequence header could lead you a taxid in NCBI GenBank.

My solution: parse fasta -> last part of your header (VTX00166) -> search entrez (e.g., API in biopython) -> get taxon id -> translate taxon id using taxdump -> get taxonomy info -> modify original fasta file

Hope it helps.

Edit1:

if you have a large sequence file to query, you may consider downloading the GI to taxid from ftp://ftp.ncbi.nih.gov/pub/taxonomy/. You will need to parse and query the results on your local machine.

ADD COMMENT • link 7.9 years ago by wjidea ▴ 50

0

Entering edit mode

It's actually the first part of the sequence ID that represents the NCBI sequence accession number. So how would I go about matching up the first part of the headers with the NCBI database taxon information? I am able to remove the latter part of the header so it looks like:

gb|AB046938
TGAAACTGCTAATGGCTCATTAA

gb|AB046939
TGAAACTGCTAGGGGCTCATTAA

Do you know any scripts to make the relationship between these sequence IDs and the NCBI taxon information? And then change the headers to reflect taxon information?

ADD REPLY • link 7.9 years ago by mollysil ▴ 40

0

Entering edit mode

Here is one biopython solution:

from Bio import Entrez

Entrez.email = 'your@email.com' # tell NCBI who you are

fetch = Entrez.efetch(db="nucleotide", id="AB046939", rettype="gb", retmode="text")
result = fetch.read().split('\n')

for line in result:
    # to get taxonomy
    if 'ORGANISM' in line:
        print ' '.join(line.split()[1:])

    # if you want the taxid
    if 'taxon:' in line:
        print line.split('"')[1]