It is a simple task to predict the ultimate richness (diversity) of a metagenomic sample (for example, by using Chao1 estimator to get the total number of different species/OTUs that could be present in the sample). It is also very easy to calculate relative abundances of species identified in metagenomic dataset - just by dividing the total number of sequences by the number of sequences corresponding to specific species.
I was wondering if there is a way to predict relative abundances of unidentified species (e.g., predict the relative abundance of the least abundant species of all species present in the sample according to Chao1 estimator).
Hi Mikhail, thank you very much for your helpful suggestion and the papers.
I agree that Chao1 and similar estimators (ACE) tend to underestimate true species richness, especially when sequencing depth is low; nevertheless, these estimators are widely accepted and used in metagenomic studies (Huber 2007).
The Good-Turing estimator is almost what I wanted, but it gives the total relative abundance of unobserved species in the community. Is it possible to estimate individual relative abundances of unobserved species? In other words, let's assume that after sequencing of 1000 sequences of a metagenomic sample 6 different OTUs were found - and Chao1 suggested that there would be 10 species/OTUs in the sample in total, so 4 of 10 species remained unobserved. Is there a way to estimate individual relative abundances of unobserved species (relative abundances of 7th, 8th, 9th and 10th species)?
At the moment I have some preliminary data about my sample obtained using 16S pyrosequencing; the diversity is quite low (I found 180 different OTUs and according to Chao1 there should be about 200 OTUs in total) and these results are in good agreement with literature values. The next step will be shotgun Illumina sequencing of my total metagenomic DNA; I don't want to spend more money on sequencing than really needed, so I am trying to assess the sequencing depth required to get at least 9x coverage of genome of the least abundant species in my sample using Lander-Waterman calculations:
Where
C
stands for coverageG
is the haploid genome lengthL
is the read lengthN
is the number of readsa
is the relative abundance of the least abundant species