Hello,
I have metagenomic sequencing data (Illumina NovaSeq 6000) of agricultural soil samples (1 000 000 - 10 000 000+ reads per sample), that I've taxonomically identified (after processing steps of course) using kraken2
and the core nt
NCBI database.
For dowstream analyses, I want to rarefy
these data using the rrarefy
function of vegan
. (I know rarefying can be tricky and results in loss of data, but for the downstream analyses I need it.)
To check how much reads should be randomly sampled, I created a rarefaction plot of a subset of 10 samples with using vegan::rarecurve(taxa_table_bac.k.10,step=1000, label=FALSE)
. See below image below.
As you can see in the graph below, a saturation/plateau is 'already' reached around 100 000 reads. For me this is quite surprising, as I thought that - in the context of metagenomic data vs amplicon data - it would take several million reads to reach saturation for metagenomic data. Of course this is sample type dependent, but soil samples are generally complex and species rich. With 16S amplicons I've seen such saturation around 30 000 - 40 000 reads, so not that far off.
(even if it is a subset of 10 samples of a total of 300 samples, I guess it will be a similar pattern overall samples).
Is it 'normal' to reach a saturation plateau so quickly with metagenomics? What are your experience with species saturation when using metagenomic reads?
The samples have several thousands of species detected. I expected a bit more, but still it seems like a fair amount.