Hi,
I am trying to BLAST many reads against the MaarjAM database (http://maarjam.botany.ut.ee/?action=sTax), a database strictly for arbuscular mycorrhizal fungi sequences. I was able to download the FASTA sequences from the database into a text file and then converted that into a blast-able database using "makeblastdb -in MaarjAM_18s_seq.txt -out MaarjAMdb -dbtype nucl". The main problem is that the headers in the original file are not useful. They look like this:
>gb|AB076274_2004_Saito,_M._GlAc2.1_VTX00166
GGGACATCATGTCGGTCGTGCCTCGGTACGTACTGGTATTGTTGGTTTCTCCCTTCTGACGAACCATGATGTCATTTATT
TGGTGTTGTGGGGAATCAGGACTGTTACTTTGAAAA
>gb|LN620567_2015_Davison,_J._sp._VTX00311
AGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGAATTTCGGGGTC
AGTAGATTGGTCGTGCCACTGGTACGTACTGGTCTTACTGATTCCTCCCTCCTGATGAACTGTAATGCCATTAAT
The headers list out the publication information for the sequence instead of having useful information like the Genus and species names (which is what I need it to say!). I use Blast+ for BLASTing against the NCBI database and it works fine. But using Blast+ for this database is not giving me taxon assignments. Instead, I get alignments with no assignments. Any ideas for fixing this problem easily, without much in the way of new software or package downloads? I am using Linux and I have been using MEGAN to import the blast files. I need to change the headers to provide taxonomic information OR figure out another way to get proper taxon assignments!
Thanks! Molly
It's actually the first part of the sequence ID that represents the NCBI sequence accession number. So how would I go about matching up the first part of the headers with the NCBI database taxon information? I am able to remove the latter part of the header so it looks like:
Do you know any scripts to make the relationship between these sequence IDs and the NCBI taxon information? And then change the headers to reflect taxon information?
Here is one biopython solution:
I cannot get Biopython. But thank you anyway!