Taxonomy Of Blast Hits

Entering edit mode

15.4 years ago

Darked89 4.7k

Lets have 200k genomic contigs with some (unknown) bacterial contamination.

I blasted (blastn vs nr) all of them, got tabulated output and passed the uniq acc nos ca 5k to Batch Entrez. Since neither my target genome nor bacterias causing contamination are not sequenced, I got a shotgun of results (3000 Eukaryota, 2000 Bacteria, few viruses).

Now for a tricky part: what I need is: sequence_identifier + taxonomic_id(s) + main_tax_group

something along the line:

A000001 573 Bacteria

Apart from writing a script storing the sequence & taxonomy info into say MySQL, then going through blast top hits output, are there any tools (taverna work flows?) which can do it for me?

re Pierre

Primary input is text blast output of:

blastcl3 -p blastn -m 9 -e 0.00001 -b 1 -i frag01 -o out_blastn_frag01

I grep-ed and awk-ed hit acc numbers from second column. Resulting text file (one acc no per line) was feed to Batch Entrez. As far as I can tell there is no way of selecting output in form: A000001 573 Bacteria The most parsable output seems to be TinyXML, but then I will download full bacterial genomes / eukaryotic chromosomes worth of sequence which at this stage I do not need.

Ideally instead of two extremes (E.coli K12 + Bacteria) getting a whole taxonomic path:

cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia; Escherichia coli

will be preferred. That way one can zoom in (select more than just species/strain and taxonomic Kingdom).

So at this moment I am split between using (1) just blast tabulated text output or selecting some Batch Entrez output which then I will be able to combine with (1).

re giovanni single line which gets squezzed a bit here:

contig62836  gi|119525916|gb|CP000508.1|     93.18   44      3       0       1109    1152    262350  262393  2e-06   63.9

Before each of the top hits there is blast header with hash sign in front:

# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score

So simple:

grep -A 1 Fields out_blastn_frag0* | grep contig | awk '{ print $2}' | awk 'FS="|" {print $4}' | sort | uniq > all_uniq_hits_100302.txt

gives me list off unique accession numbers of my top hits suitable for Batch Entrez.

re XML: yes, but I tried to avoid too much network traffic. XML for half a million contigs is a lot of data. save for oneliners I am using python.

blast taxonomy • 15k views

ADD COMMENT • link updated 21 months ago by Ram 45k • written 15.4 years ago by Darked89 4.7k

Entering edit mode

hum, not sure I understand what is your input... An example ?

ADD REPLY • link 15.4 years ago by Pierre Lindenbaum 166k

Entering edit mode

Can you also show an example of the table output from blast? Anyway it is better to use the xml output as it is more stable over time. Also, are you doing this in any particular programming language or tool?

ADD REPLY • link 15.4 years ago by Giovanni M Dall'Olio 28k

Entering edit mode

15.4 years ago

Pierre Lindenbaum 166k

I you want to get the TinySeq XML without getting the sequence, I would create a SAX parser that would only get the value of the TaxonId and ignoring the other field (see "class TinySeqHandler" in http://code.google.com/p/lindenb/source/browse/trunk/proj/tinytools/src/org/lindenb/tinytools/TwitterOmics.java for an example). Having the taxonId you can get the full lineage from

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=YOUR_TAXON_ID&retmode=xml

	package org.lindenb.acn2taxonomy;

	import java.io.BufferedReader;
	import java.io.File;
	import java.io.IOException;
	import java.io.InputStream;
	import java.io.InputStreamReader;
	import java.net.URL;
	import java.net.URLConnection;
	import java.util.ArrayList;
	import java.util.List;
	import java.util.logging.Level;
	import java.util.logging.Logger;
	import java.util.regex.Pattern;

	import javax.xml.parsers.DocumentBuilder;
	import javax.xml.parsers.DocumentBuilderFactory;
	import javax.xml.parsers.SAXParser;
	import javax.xml.parsers.SAXParserFactory;

	import org.lindenb.berkeley.db.PrimaryDB;
	import org.lindenb.io.IOUtils;
	import org.lindenb.me.Me;
	import org.lindenb.util.C;
	import org.lindenb.util.Compilation;
	import org.lindenb.util.StringUtils;
	import org.lindenb.xml.XMLUtilities;
	import org.w3c.dom.Document;
	import org.w3c.dom.Element;
	import org.xml.sax.Attributes;
	import org.xml.sax.InputSource;
	import org.xml.sax.SAXException;
	import org.xml.sax.helpers.DefaultHandler;

	import com.sleepycat.bind.tuple.IntegerBinding;
	import com.sleepycat.bind.tuple.TupleBinding;
	import com.sleepycat.bind.tuple.TupleInput;
	import com.sleepycat.bind.tuple.TupleOutput;
	import com.sleepycat.je.DatabaseConfig;
	import com.sleepycat.je.Environment;
	import com.sleepycat.je.EnvironmentConfig;

	public class AcnToTaxonomy
	{
	private static final Logger LOG=Logger.getLogger("org.lindenb");
	private File baseDir=new File(System.getProperty("java.io.tmpdir"));
	private File dbHome=null;
	private Environment environment=null;
	private PrimaryDB<Integer, TaxonNode> id2taxon=null;
	private DocumentBuilder docBuilder;
	private long sleep_time=100;

	private static class TaxonNode
	{
	int id;
	String name="";
	int parent_id=-1;
	}

	private static class TaxonBinding
	extends TupleBinding<TaxonNode>
	{
	@Override
	public TaxonNode entryToObject(TupleInput in)
	{
	TaxonNode n=new TaxonNode();
	n.id=in.readInt();
	n.name=in.readString();
	n.parent_id=in.readInt();
	return n;
	}
	@Override
	public void objectToEntry(TaxonNode node, TupleOutput out)
	{
	out.writeInt(node.id);
	out.writeString(node.name);
	out.writeInt(node.parent_id);
	}
	}

	private class TinyXmlHandler
	extends DefaultHandler
	{
	private StringBuilder text=null;
	private int TSeq_taxid=-1;
	private String TSeq_defline=null;
	private String error=null;
	TinyXmlHandler(String acn)
	{

	}
	@Override
	public void startElement(String uri, String localName, String name,
	Attributes attributes) throws SAXException
	{
	text=null;
	if(StringUtils.isIn(name,"TSeq_taxid","TSeq_defline","Error"))
	{
	this.text=new StringBuilder();
	}
	}
	@Override
	public void endElement(String uri, String localName, String name) throws SAXException
	{
	if(name.equals("TSeq_taxid")) { this.TSeq_taxid= Integer.parseInt(this.text.toString());}
	else if(name.equals("TSeq_defline")) { this.TSeq_defline= this.text.toString();}
	else if(name.equals("Error")) { this.error= this.text.toString();}
	text=null;
	}
	@Override
	public void characters(char[] ch, int start, int length)
	throws SAXException {
	if(this.text!=null) text.append(ch, start, length);
	}
	}

	private AcnToTaxonomy()
	throws Exception
	{
	DocumentBuilderFactory f=DocumentBuilderFactory.newInstance();
	f.setCoalescing(true);
	f.setNamespaceAware(false);
	f.setValidating(false);
	f.setExpandEntityReferences(true);
	f.setIgnoringComments(true);
	f.setIgnoringElementContentWhitespace(true);
	this.docBuilder= f.newDocumentBuilder();
	}


	private void open() throws IOException
	{
	this.dbHome=IOUtils.createTempDir(this.baseDir);
	LOG.info("created "+this.dbHome);
	EnvironmentConfig envConfig= new EnvironmentConfig();
	envConfig.setAllowCreate(true);
	envConfig.setReadOnly(false);
	this.environment= new Environment(dbHome, envConfig);
	LOG.info("opened bdbd env");
	DatabaseConfig dbConfig=new DatabaseConfig();
	dbConfig.setAllowCreate(true);
	dbConfig.setReadOnly(false);
	this.id2taxon=new PrimaryDB<Integer, TaxonNode>(this.environment, null, "id2taxon", dbConfig, new IntegerBinding(), new TaxonBinding());
	}

	private void close()
	{
	if(this.id2taxon!=null)
	{
	LOG.info("closing database");
	this.id2taxon.close();
	this.id2taxon=null;
	}
	if(this.environment!=null)
	{
	LOG.info("closing bdbd env");
	this.environment.close();
	this.environment=null;
	}
	if(this.dbHome!=null)
	{
	for(File f: this.dbHome.listFiles())
	{
	f.delete();
	}
	this.dbHome.delete();
	this.dbHome=null;
	}
	}
	private InputStream openURL(URL url)throws IOException
	{
	final int max_try=10;
	for(int try_count=0;try_count<max_try;++try_count)
	{
	InputStream is=null;
	try
	{
	URLConnection con=url.openConnection();
	con.setConnectTimeout(10*1000);
	is=con.getInputStream();
	return is;
	}
	catch(Exception err)
	{
	System.err.println("Cannot open "+url+" trying... "+(try_count+1)+"/"+try_count);
	try
	{
	Thread.sleep(10*1000);
	}
	catch (InterruptedException e)
	{
	}
	}
	}
	throw new IOException("Cannot open "+url);
	}



	private StringBuilder taxopath(int taxonid,StringBuilder str) throws Exception
	{
	TaxonNode node=this.id2taxon.get(null, taxonid);
	if(node==null)
	{
	String url="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id="+taxonid+"&retmode=xml&tool=acn2tax&email=plindenbaum_at_yahoo_fr";
	InputStream in=openURL(new URL(url));
	Document dom= this.docBuilder.parse(new InputSource(in));
	in.close();
	Element root=dom.getDocumentElement();
	Element Taxon=XMLUtilities.one(root, "Taxon");

	node=new TaxonNode();

	Element TaxId=XMLUtilities.one(Taxon, "TaxId");
	Element ScientificName=XMLUtilities.one(Taxon, "ScientificName");
	node.id= Integer.parseInt(TaxId.getTextContent());
	node.name= ScientificName.getTextContent();

	Element LineageEx=XMLUtilities.one(Taxon, "LineageEx");
	List<Element> taxons= XMLUtilities.elements(LineageEx, "Taxon");

	List<TaxonNode> nodes= new ArrayList<TaxonNode>(taxons.size());
	for(Element e: taxons)
	{
	TaxId=XMLUtilities.one(e, "TaxId");
	ScientificName=XMLUtilities.one(e, "ScientificName");
	TaxonNode newnode= new TaxonNode();
	newnode.id= Integer.parseInt(TaxId.getTextContent());
	newnode.name= ScientificName.getTextContent();
	nodes.add(newnode);
	}
	nodes.add(node);

	for(int i=1;i< nodes.size();i++)
	{
	nodes.get(i).parent_id=nodes.get(i-1).id;
	if(!this.id2taxon.containsKey(null,nodes.get(i).id))
	{
	this.id2taxon.put(null,nodes.get(i).id,nodes.get(i));
	}
	}
	}
	else
	{
	str.insert(0,"\""+C.escape(node.name)+"\"("+node.id+")"+(str.length()==0?"":" > "));
	}
	if(node.parent_id>0)
	{
	taxopath(node.parent_id,str);
	}
	return str;
	}



	private void run(BufferedReader in) throws Exception
	{
	SAXParserFactory f= SAXParserFactory.newInstance();
	f.setNamespaceAware(false);
	f.setValidating(false);
	SAXParser parser=f.newSAXParser();
	Pattern pattern=Pattern.compile("[a-z][a-z_0-9]+(\.[0-9]+)?",Pattern.CASE_INSENSITIVE);
	String line;
	while((line=in.readLine())!=null)
	{
	if(line.startsWith("#")) continue;
	line=line.trim();
	if(line.isEmpty()) continue;
	if(!pattern.matcher(line).matches())
	{
	System.err.println("Invalid acn "+line+" does not match "+pattern.pattern());
	continue;
	}
	String api_url="http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id="+
	line+
	"&rettype=fasta&retmode=xml&tool=acn2tax&email=plindenbaum_at_yahoo_fr"
	;
	LOG.info(api_url);

	URL url=new URL(api_url);
	InputStream is=openURL(url);
	TinyXmlHandler handler=new TinyXmlHandler(line);
	parser.parse(is, handler);
	is.close();
	if(handler.error!=null)
	{
	System.err.println("#Error: cannot get "+line+" : "+handler.error);
	}
	else
	{
	StringBuilder taxonpath=taxopath(handler.TSeq_taxid,new StringBuilder());
	System.out.println(line+"\t\""+C.escape(handler.TSeq_defline)+"\"\t"+taxonpath);
	}
	try { Thread.sleep(this.sleep_time);}catch(Exception e2) {}
	}
	}

	public static void main(String[] args)
	{
	AcnToTaxonomy app=null;
	try
	{
	app=new AcnToTaxonomy();
	LOG.setLevel(Level.OFF);
	int optind=0;
	while(optind< args.length)
	{
	if(args[optind].equals("-h") \|\|
	args[optind].equals("-help") \|\|
	args[optind].equals("--help"))
	{
	System.err.println(Me.FIRST_NAME+" "+Me.LAST_NAME+" "+Me.MAIL);
	System.err.println(Compilation.getLabel());
	System.err.println("Options:");
	System.err.println(" -b <dir> base directory for bdb files:"+app.baseDir);
	System.err.println(" --log-level <level> one of "+Level.class.getName());
	System.err.println(" -h help; This screen.");
	return;
	}
	else if(args[optind].equals("--log-level"))
	{
	LOG.setLevel(Level.parse(args[++optind]));
	}
	else if(args[optind].equals("-b"))
	{
	app.baseDir=new File(args[optind++]);
	if(!app.baseDir.exists())
	{
	System.err.println("File does not exist: "+app.baseDir);
	return;
	}
	if(!app.baseDir.isDirectory())
	{
	System.err.println("File is not a directory: "+app.baseDir);
	return;
	}
	break;
	}
	else if(args[optind].equals("--"))
	{
	optind++;
	break;
	}
	else if(args[optind].startsWith("-"))
	{
	System.err.println("Unknown option "+args[optind]);
	return;
	}
	else
	{
	break;
	}
	++optind;
	}
	app.open();
	if(optind==args.length)
	{
	app.run(new BufferedReader(new InputStreamReader(System.in)));
	}
	else
	{
	while(optind< args.length)
	{
	java.io.BufferedReader r= IOUtils.openReader(args[optind++]);
	app.run(r);
	r.close();
	}
	}
	}
	catch(Throwable err)
	{
	err.printStackTrace();
	}
	finally
	{
	if(app!=null) app.close();
	}
	}
	}

view raw AcnToTaxonomy.java hosted with ❤ by GitHub

	all:
	mkdir -p acn2tax/lib
	mkdir -p acn2tax/tmp
	javac -d acn2tax/tmp -cp /usr/local/package/je-4.0.71/lib/je-4.0.71.jar
	-sourcepath src:/home/pierre/lindenb/src/java src/org/lindenb/acn2taxonomy/AcnTo
	Taxonomy.java
	jar cvf acn2tax/lib/acn2tax.jar -C acn2tax/tmp .
	-cp /usr/local/package/je-4.0.71/lib/je-4.0.71.jar acn2tax/lib/
	rm -rf acn2tax/tmp
	zip -r acn2tax.zip acn2tax

view raw Makefile hosted with ❤ by GitHub

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 15.4 years ago by Pierre Lindenbaum 166k

Entering edit mode

please don't be shy about continuing your discussion here... If you continue your discussion in private, then it is of no use for the other readers.

ADD REPLY • link 15.4 years ago by Giovanni M Dall'Olio 28k

Entering edit mode

Pierre's Java program did work as promised. For 6.6k accessions it took ca 6 hours. Thank you :-). Now it is my part to combine it with blast output /contig sizes etc.

ADD REPLY • link 15.4 years ago by Darked89 4.7k

Entering edit mode

Dear Pierre, I just send you an email to your yahoo address. In short I think it is a great idea. i will look into your code tomorrow. Thank you

ADD REPLY • link 15.4 years ago by Darked89 4.7k

Entering edit mode

put the source code I wrote on gist: http://gist.github.com/320585

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 15.4 years ago by Pierre Lindenbaum 166k

Entering edit mode

Sorry but the email was non-technical. All about what I possibly can do for Pierre for (in some way at least) doing my homework. Surely it may be interesting in the longer run how do we return the favors (authorship? $$$, invitation to give a talk?) but often person asking the question is not at the helm (can not promise much). Hope it explains a bit.

As for Pierre's program, once it stops running and I check the output I will write about it.

ADD REPLY • link 15.4 years ago by Darked89 4.7k

Entering edit mode

15.4 years ago

Michael 56k

From the description of your input data I guess that you are trying to do a taxonomic classification of sequences in a metagenomics approach. I further assume that you have about 200.000 reads or sequences (or do you alternatively mean assembled contigs of length 200 kB?). I am not sure if I completely understand the question, but whatever you do, filtering out the tax ids with your own script might not be the best option.

I assume further you wish to compute a tree of the taxonomic composition of the data in total.

That way one can zoom in (select more than just species/strain and taxonomic Kingdom)

For this task you might want to try the MEGAN (Metagenome Analysis) software.

Actually, what you are describing looks very much like one of the publications they have in their publications list:

H. N. Poinar, C. Schwarz, Ji Qi, B. Shapiro, R. D. E. MacPhee, B. Buigues, A. Tikhonov, D. H. Huson, L. P. Tomsho, A. Auch, M. Rampp, W. Miller, S. C. Schuster, Metagenomics to Paleogenomics: Large-Scale Sequencing of Mammoth DNA, Science 311:392-394, 2006

There is also a tutorial on setting the right BLAST parameters for use with short reads.

So in principle, this program could do the job or at least you can have a look at the right parameters for blast.

ADD COMMENT • link updated 21 months ago by Ram 45k • written 15.4 years ago by Michael 56k

Entering edit mode

Thank you. These are genomic contigs. "Metagenomics" is accidental. Input DNA was from two sources, one of which contained bacterias. No idea if these come from dirty root, lived between plant cells, within them(?) but it does not look like bacterial lab strain jumping to a new bottle.
Sequence produced by 454s, assembled by Newbler. My contigs are anywhere from 100bp to 1Mbp. So I am expecting one plant genome and at least one, possibly many bacterial species. Single, 1Mb large bacterial contig can hit multiple species of bacteria (blastp using predicted genes, 40-70% similarity hits).

ADD REPLY • link 15.4 years ago by Darked89 4.7k

Entering edit mode

Many plants undergo symbiotic interactions with soil-bacteria, e.g. legume palnts and nitrogen-fixing rhizobia, these form root nodules. If you sample from the wild, you have possibly discovered the plant and its symbiont(s) in between root cells, otherwise just dirt. Just speculation, depends on how you got the sample.

Anyway, maybe then it's maybe best to sort out the individual reads on the domain level (bacteria vs. eukaryota) and assemble afterwards. At least if mainly interested in a pure assembly.

ADD REPLY • link 15.4 years ago by Michael 56k

Entering edit mode

re reassembly: indeed this is what we will do in the end. But since we already have draft assembly reducing the total sequence length I went for contig screen first. As a bonus one can confirm that 454 contigs contain our plant DNA by mapping (blast pre-screened) ESTs, GSS sequences and Illumina short reads from bacteria-free (so far...) leaves / mRNAs.

ADD REPLY • link 15.4 years ago by Darked89 4.7k

Entering edit mode

15.3 years ago

Anar ▴ 40

Instead of submitting your IDs to Batch Entrez, you could simply extract tax info from the taxonomy flat file available from NCBI:

ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz = protein taxonomy info
ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz = nucleotide taxonomy info

Saves mucking around with XML format. Also avoids making tons of calls to EUtils over the network.

This might be useful too: http://www.bioperl.org/wiki/Module:Bio::DB::Taxonomy

ADD COMMENT • link updated 21 months ago by Ram 45k • written 15.3 years ago by Anar ▴ 40