Question

How to treat bacterial genomes made up of multiple scaffolds in in silico metagenome data generation

0

Entering edit mode

2.9 years ago

Bertalan_Takacs ▴ 100

I am working on testing and comparing different metagenomic classifiers and I am planning to generate in silico NGS runs to test their accuracy. For this I use bacterial genome assemblies downloaded from NCBI. It is very important for me to know exactly how many reads I generate from each species and the number of all reads is set. (E. g. I need to generate 2000000 reads in total, 400000 from species A,B,C,D and E each). My issue is that several of the downloaded genomes are made up of multiple scaffolds (e.g. my collection of 12 species contains 40 scaffolds) and I don't know how to treat these scaffolds. Since a lot of them are very short, I considered simply just leaving them out and using only the longest for each species, but I don't want that to influence the accuracy of the classification. Similarly, if I merge the scaffolds together, that can lead to artefacts decreasing the accuracy. Lastly, I thought about taking the scaffold numbers in consideration and dividing the read numbers of the species between them. For this I would need to take also the quality and length of the scaffold in consideration, which seems unnecessarily tedious, because some of my datasets contain hundreds of bacterial species. What to do?

EDIT: The quality scores are missing for some scaffolds, so I probably shouldn't care about that anyway.

ncbi generation assembly metagenome insilico • 787 views

ADD COMMENT • link 2.9 years ago by Bertalan_Takacs ▴ 100

score 2 · Accepted Answer · 2022-02-07

Most of prokaryotic genomes in NCBI are incomplete, and come in multiple contigs. That's the nature of short-reads sequencing, and in particular when doing metagenomics instead of single-species genomics. I think it would be impossible to get good prokaryotic representation if you went only for complete genomes. Besides, your classification will almost assuredly have to deal with incomplete genomes and short contigs, so you might as well take that into account during the training process.

Most people agree, based on discussions and published work, that short contigs can be discarded because their sequences do not contain enough signal. Opinions vary as to where the cutoff should be (anywhere from 1500 to 3000 bp). In case it helps, I usually discard contigs smaller than 2 kb when binning.