When assembling a genome with short read Illumina (or Illumia like) sequencing technologies is there an accepted bp size at which that contig should be removed from the final assembly? I have seen several papers use 300bp as a threshold. If a contig on 300 bp or smaller has good coverage I can't see any benefits to removing it?
I am specifically working with fungal and bacterial assemblies, however I assume the logic can be applies across the board.
Would you ever do that on the post filtering input read length or just the initial?
I have contigs (if you can call them that) of 72 bp in some of my assemblies and did not do any filtering by size before moving onto the downstream analysis and am now wondering if I should repeat having done this filtering step.
I would apply that on the (near) final assembly result, so not in the beginning or in the read filtering steps.
Yeah, that's my frustration as well , and while I do understand how this comes, I can't stand it that my "assembly" result is smaller than my input. Then again I'm dating back from different era in assembly so this might not be as frustrating for newer kids on the block :)
Ah sorry I meant using the post filtering smallest read length as your twice read length threshold.
My main worry for not doing it is how much it could effect the quality of data analysis downstream. Secondarily how likely it would be these smaller contigs could be contamination. However I could/ should have probably run a BLAST analysis for this. instead I just use BUSCO and trusted its contamination scores.
valid worry indeed ;)
and the effect on the downstream analysis will be depending on the type of such analysis. Eg.gene annotation will not so be affected (very unlikely you will have nice complete genes on those, so removing them will not make you loose genes) , small RNA or repeat analysis might be affected by it ...
another filtering we tend to apply is on %GC (especially when looking for contamination), the %Gc for eukaryotes will be quite different from potential bacterial or fungal contamination ....