Segregating Metagenomic Sequences In Prokaryotic And Eukaryotic Groups.
4
3
Entering edit mode
13.9 years ago
Monzoor ▴ 300

Say I have a million 454 reads generated from microbes residing from the gut of a previously unknown (unsequenced) insect. There is a high likely-hood that sequences originating from insect DNA have contaminated the sequenced data set. How do I identify and remove these sequences ? Dont suggest BLAST. Please suggest approaches which do not need high compute power.

• 5.4k views
ADD COMMENT
2
Entering edit mode

It's possible that there is contamination but it's hardly relevant because these reads might not align to microbial reference database anyways. I don't understand why you cannot blast because you will have to do that anyway. Btw. a computer that can carry out a blast of so few 454 reads is not so expensive anyways. If you do not have the ability to blast your reads you simply dont't have the compute power available you require to carry out your analysis, and the best recommendation is to aquire this first.

ADD REPLY
0
Entering edit mode

Hey this is akin to saying "BLAST and MEGAN are the last and the only resort". I was expecting this question to spawn a few novel ideas that help researchers in resource poor settings. In other words, metagenomics analysis seems currently restricted to groups that can afford huge compute resources. I guess people should be developing alignment free approaches that do not need an all vs all blast. Guess this should be possible in some manner.

ADD REPLY
2
Entering edit mode
13.9 years ago
User 59 13k

It's kind of hard to suggest a solution without using BLAST. You have an unsequenced insect, and therefore you're interested in how much contamination from this organism has bled through into the gut sample. Because your organism is unsequenced, I don't think an approach of GC content bias will work particularly well.

I have seen suggestions elsewhere that you could assemble the data first to reduce the amount of work that follows. BLAST combined with MEGAN is often used to detect contaminants. The fewer sequences you have to deal with in terms of the BLAST step, the happier you will be.

Perhaps the first step would be to identify the amount of likely contamination. Rather than BLASTing a million reads, BLAST 10,000 reads and work out what your likely contamination percentage is.

Having had to recently BLAST 5x10^6+ reads in a metagenomic sample, I understand your reluctance to do this. But at some point you're going to have to do this for gene identification, so the question is how much of a problem is the contamination in the first place.

ADD COMMENT
1
Entering edit mode

I second the use of MEGAN, the contamination reads should be classified as Eukaryote genes.

ADD REPLY
0
Entering edit mode

Assuming I find 10-15% contamination with a random sampling, I still need a solution for quantifying/separating these sequences from the entire data set.

Planning to use MGRAST for functional annotation instead of using a gene prediction method. Maybe I need to use something like Metagene for gene prediction.

ADD REPLY
0
Entering edit mode

Our researchers use MG-RAST as well. Won't MG-RAST do the phylogenetic annotation as well in the pipeline? Can you then dissect out your insect contaminants that way?

ADD REPLY
0
Entering edit mode

MG-RAST bases its phylogenetic inferences by blasting sequences against 16S/18S/28S etc reference sequences and finding relative abundance of various taxa. What it does not give is a read by read analysis of taxonomic inferences. Anyways, new insect sequences are absent from their database in any form. So there is not question of my sequences hitting the novel insect genome. Correct me if my understanding is wrong.

ADD REPLY
0
Entering edit mode

Nope, I'm not particularly familiar with MG-RAST, I knew it had a BLAST step, I didn't know it was just for the SRNAs. I accept that it is not appropriate in this context!

ADD REPLY
2
Entering edit mode
13.9 years ago
User 59 13k

You could also have a look at NBC, the naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads.

Paper here, web interface here.

ADD COMMENT
1
Entering edit mode
13.9 years ago
Heikki ▴ 360

Sound to me that you are looking for purely statistical methods for separating taxonomically different genomes from metagenomic data sets:

Andrey Kislyuk, Srijak Bhatnagar, Jonathan Dushoff, Joshua S Weitz: Unsupervised statistical clustering of environmental shotgun sequences BMC Bioinformatics 2009, 10:316 doi:10.1186/1471-2105-10-316

Fengfeng Zhou and Ying Xu: cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data Bioinformatics 26 (16) 2051–2052, 2010 doi:10.1093/bioinformatics/btq299

... and the latest Bioinformatics has:

Monzoorul Haque Mohammed, Tarini Shankar Ghosh, Nitin Kumar Singh, Sharmila S. Mande: SPHINX—an algorithm for taxonomic binning of metagenomic sequences. Bioinformatics (2011) 27 (1): 22-30. doi: 10.1093/bioinformatics/btq608

I have not tried these myself but I have a feeling that on real world data their performance is far from ideal.

ADD COMMENT
0
Entering edit mode

I have tried c-Bar and found its performance to be good with sequences that are atleast of 2000 bp lenght. The accuracy with NGS reads (length 100-400) is around 50% which in my opinion can be obtained using toss of a coin.

Being an author of SPHINX, I can only ask you to try the same. FYI, SPHINX has been tried on two real metagenomic data sets (salterns and lean/obese mouse) and results were found to be in line with earlier published reports. I would be glad to hear about your feedback on SPHINX and Sort-ITEMS.

However, cBAR or SPHINX are not meant for the question I had posed here.

ADD REPLY
1
Entering edit mode
13.9 years ago
Michael 55k

Monzoor, as you are interested in Metagenomics approaches that are not like brute force blast againt NT/NR, here are some more refined approaches. They might not be less computationally intensive but at least different:

The CARMA pipeline uses Pfam hits for phylogenetic classification. WebCARMA is the web application server to it.

TACOA uses a kernel nearest neighbor classiefier on sequence features. Also, in the background section of this article, they mention quite an amount of competing approaches (including the 'naive' blast approach)

Hope this helps

ADD COMMENT

Login before adding your answer.

Traffic: 2899 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6