Genome assembly, removal of conting below 300 bp?
1
0
Entering edit mode
3.2 years ago

When assembling a genome with short read Illumina (or Illumia like) sequencing technologies is there an accepted bp size at which that contig should be removed from the final assembly? I have seen several papers use 300bp as a threshold. If a contig on 300 bp or smaller has good coverage I can't see any benefits to removing it?

I am specifically working with fungal and bacterial assemblies, however I assume the logic can be applies across the board.

assembly • 1.7k views
ADD COMMENT
3
Entering edit mode
3.2 years ago

don't think there is a general consensus for this.(bluntly said; people likely use what fits their 'story' best )

A rule I always applied to my own assembly projects is to remove any contig that is smaller than roughly twice the read length of the input reads. Input reads: 150nt PE , my threshold would then be at ~ 300nt .

Filtering on coverage as you mention is of course also an acceptable measure but somewhat more 'work' to implement that (filtering on length is very easy and straightforward).

In any case all these tricks are questionable in the end, meaning you have exceptions and difficulties for all approaches.

ADD COMMENT
0
Entering edit mode

twice the read length of the input reads.

Would you ever do that on the post filtering input read length or just the initial?

I have contigs (if you can call them that) of 72 bp in some of my assemblies and did not do any filtering by size before moving onto the downstream analysis and am now wondering if I should repeat having done this filtering step.

ADD REPLY
0
Entering edit mode

I would apply that on the (near) final assembly result, so not in the beginning or in the read filtering steps.

Yeah, that's my frustration as well , and while I do understand how this comes, I can't stand it that my "assembly" result is smaller than my input. Then again I'm dating back from different era in assembly so this might not be as frustrating for newer kids on the block :)

ADD REPLY
0
Entering edit mode

Ah sorry I meant using the post filtering smallest read length as your twice read length threshold.

My main worry for not doing it is how much it could effect the quality of data analysis downstream. Secondarily how likely it would be these smaller contigs could be contamination. However I could/ should have probably run a BLAST analysis for this. instead I just use BUSCO and trusted its contamination scores.

ADD REPLY
0
Entering edit mode

valid worry indeed ;)

and the effect on the downstream analysis will be depending on the type of such analysis. Eg.gene annotation will not so be affected (very unlikely you will have nice complete genes on those, so removing them will not make you loose genes) , small RNA or repeat analysis might be affected by it ...

another filtering we tend to apply is on %GC (especially when looking for contamination), the %Gc for eukaryotes will be quite different from potential bacterial or fungal contamination ....

ADD REPLY

Login before adding your answer.

Traffic: 2152 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6