I am working on testing and comparing different metagenomic classifiers and I am planning to generate in silico NGS runs to test their accuracy. For this I use bacterial genome assemblies downloaded from NCBI. It is very important for me to know exactly how many reads I generate from each species and the number of all reads is set. (E. g. I need to generate 2000000 reads in total, 400000 from species A,B,C,D and E each). My issue is that several of the downloaded genomes are made up of multiple scaffolds (e.g. my collection of 12 species contains 40 scaffolds) and I don't know how to treat these scaffolds. Since a lot of them are very short, I considered simply just leaving them out and using only the longest for each species, but I don't want that to influence the accuracy of the classification. Similarly, if I merge the scaffolds together, that can lead to artefacts decreasing the accuracy. Lastly, I thought about taking the scaffold numbers in consideration and dividing the read numbers of the species between them. For this I would need to take also the quality and length of the scaffold in consideration, which seems unnecessarily tedious, because some of my datasets contain hundreds of bacterial species. What to do?
EDIT: The quality scores are missing for some scaffolds, so I probably shouldn't care about that anyway.
Okay, I see. Thank you for the explanation!