I have assembled a bacterial genome using spades. The coverage among the contigs is very heterogeneous, while some have more than 30 fold of coverage, other have less than 5 fold. Also, there are a lot of contigs with a length less than 500 bp. I'm aware that I must filter contigs having low coverage and small length (they probably are spurious sequencing products), however what would be reasonable criteria for filtering them? I have found scripts that remove contigs having less than 500 pb and 2 fold coverage. I believe this setting is too permissive, since most of the contigs of my assembly significantly deviate from it.
EDIT:
As we can see, most of contigs have less than 1000 bp, Most of these contigs have less than 5 fold of coverage. My plan is to filter contigs having less than 1000 bp and 5 fold of coverage. After, I am going to check the completeness of the filtered genome sequence using checkm (contrasting to the unfiltered genome).
EDIT 2:
I think it is a very good question, the answer may depends what you are looking for, firstable, a de novo assembly is absolutely necesary? or a guided assembly is possible? Which is the lenght expected to be? how many reads do you have? paired? un-paired? lenght? I. E. If you have sequenced 100x of the expected length may be you can ignore contigs which coverage is lower than 10x, but if you sequenced 30x, may be you think about it twice.
Yes, it is necessary a de novo assembly, because we have sequenced a novel bacterial species. The expected length is between 5 - 6 Mpb. We have about 840,000 paired reads of ~250 bp (a raw coverage of about 80x, (2 x 250 x 840000)/5000000). After genome assembly I have 13 contigs having more than 20 fold of coverage and the rest of them has less than 1kb and less than 5 fold of coverage. In this case it seems pretty clear that I should remove the latter contigs. However I would like to have a rationale of filtering, because we have other genomes with 10 fold of coverage. In this case it is less intuitive to decide which is the proper threshold for contig filtering. Maybe I should plot the distribution of coverage and length in order to decide the best cutoffs (length and coverage).
Hello, How do you calculate the coverage of each contig? Thanks.