Lets have 200k genomic contigs with some (unknown) bacterial contamination.
I blasted (blastn vs nr) all of them, got tabulated output and passed the uniq acc nos ca 5k to Batch Entrez. Since neither my target genome nor bacterias causing contamination are not sequenced, I got a shotgun of results (3000 Eukaryota, 2000 Bacteria, few viruses).
Now for a tricky part: what I need is: sequence_identifier + taxonomic_id(s) + main_tax_group
something along the line:
A000001 573 Bacteria
Apart from writing a script storing the sequence & taxonomy info into say MySQL, then going through blast top hits output, are there any tools (taverna work flows?) which can do it for me?
re Pierre
Primary input is text blast output of:
blastcl3 -p blastn -m 9 -e 0.00001 -b 1 -i frag01 -o out_blastn_frag01
I grep-ed and awk-ed hit acc numbers from second column. Resulting text file (one acc no per line) was feed to Batch Entrez. As far as I can tell there is no way of selecting output in form: A000001 573 Bacteria The most parsable output seems to be TinyXML, but then I will download full bacterial genomes / eukaryotic chromosomes worth of sequence which at this stage I do not need.
Ideally instead of two extremes (E.coli K12 + Bacteria) getting a whole taxonomic path:
cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia; Escherichia coli
will be preferred. That way one can zoom in (select more than just species/strain and taxonomic Kingdom).
So at this moment I am split between using (1) just blast tabulated text output or selecting some Batch Entrez output which then I will be able to combine with (1).
re giovanni single line which gets squezzed a bit here:
contig62836 gi|119525916|gb|CP000508.1| 93.18 44 3 0 1109 1152 262350 262393 2e-06 63.9
Before each of the top hits there is blast header with hash sign in front:
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
So simple:
grep -A 1 Fields out_blastn_frag0* | grep contig | awk '{ print $2}' | awk 'FS="|" {print $4}' | sort | uniq > all_uniq_hits_100302.txt
gives me list off unique accession numbers of my top hits suitable for Batch Entrez.
re XML: yes, but I tried to avoid too much network traffic. XML for half a million contigs is a lot of data. save for oneliners I am using python.
hum, not sure I understand what is your input... An example ?
Can you also show an example of the table output from blast? Anyway it is better to use the xml output as it is more stable over time. Also, are you doing this in any particular programming language or tool?