Background: I am working on a project where we have samples from many novel species of an invertebrate (no idea on genome size) on which we want to perform proteomic analysis. Shotgun proteomics uses tandem mass spectrometry data to search against a database of known proteins. Typically CDS annotations annotations of a genome assembly (though additional experimental evidence helps the assignments). For this project the plan is to use RNA-seq followed by de novo assembly to construct an exon database for each species. We have done this once, but it didn't work great (but we are fine tuning our assembly parameters based on downstream performance).
Question: How many clusters of 100bp paired-end reads should we do to generate the de novo assembly (25, 50, 75 or 100M)? Based on the number of species we want to do, this step should be cost-efficient, but with downstream processing time, followed by proteomic analysis and computational time there as well, it may be that 100M is actually the best idea due to the quality trade-off at 25M. I would say the biggest downstream problem is having a really fragmented assembly since that makes the proteomic search space prohibitively large.
Any thoughts or suggestions are appreciated. I have zero experience in knowing what quality differences there will be between 25 and 100M sequencing coverage. Thanks.