Filtering contigs by length during assembly?
0
0
Entering edit mode
6.2 years ago
n,n ▴ 370

Hello there guys I have a question that has been bothering me for a while since I can't find any clear documentation about it (in the context of bacterial illumina pair-end reads assembly).

As you know many assemblers have the option to filter contigs out of the final assembly when they are below a selected nucleotide length. This minimum contig length parameter is set by the user in any way she/he sees fit, so I was wondering if there was some kind of advice or guidelines to not be arbitrary when determining this parameter. I've seen that some people don't like working with contigs with less than 500 nucleotides because its very likely that they don't contain relevant information or that they result from contamination of the samples, however this doesn't seem to be a standard from what I can tell.

Any experiences/thoughts/references would be greatly appreciated. Thank you!

Assembly • 4.4k views
ADD COMMENT
2
Entering edit mode

I tend to use 2x the read length as threshold. Quite liberal, yes, but putting two reads next to each other sounds like the minimum assembly you can do.

But overall there is no commonly accepted lower boundary I assume

ADD REPLY
0
Entering edit mode

So you mean if you have an average read length of 150 nucleotides (as it is frequent for illumina pair-end reads), you would do it with 300 right? This is something I hadn't thought of. Thank you for your advice.

ADD REPLY
1
Entering edit mode

for instance yes.

But I also agree with others here that is if you have to rely on including those very small contigs, your assembly is likely not that (high) quality. With a decent assembly you should be able to get to ~1000bp contigs quite straightforwardly .

ADD REPLY
0
Entering edit mode

You have to base it on your question a little. Are you looking for some ‘killer SNP’ that’s super important so you want to find every possible base alteration? Maybe you want a lower threshold.

if you want the simplest/cleanest/most intact genome assembly, set it as high as you can before you feel like you’re throwing away good data. A couple of KB would be more than enough I imagine.

ADD REPLY
1
Entering edit mode

The "optimal" contig length to filter is probably dependent on the dataset. Also, I believe using multiple criteria (such as kmer / mapping coverage, taxonomic blast identity, GC content, and possibly others) is better than filtering by contig length alone.

Some other related threads:

Criteria for filtering contigs after spades assembly

Behavior of short read assemblers regarding filtering out short contigs

ADD REPLY
0
Entering edit mode

Thanks for the reply. I didn't really think about mixing it up with other parameters until now, will definitely look it up, cheers.

ADD REPLY
1
Entering edit mode

Some people will say approximately 1000 basepairs, since the average prokaryotic gene is about this length, so anything smaller is almost certainly garbage. I personally think even 1000 is probably generous.

ADD REPLY
0
Entering edit mode

Ive experimented with 1000 and it feels pretty solid but I didn't think I could back it up with average prokaryotic gene length. I actually found a reference for this info - http://bioscience.jbpub.com/cells/MBIO137.aspx - thanks you for this tip.

EDIT: realized by reading other posts that another good strategy if there are other contig level assemblies of your strain in NCBI is to do a brief analysis of TSV assembly summary files from those assemblies and see how the filtering of the contigs was approached. In my case there were 10 assemblies and not a single contig below 1000bp appears in them meaning they probably used a 1000 filter as standard.

ADD REPLY

Login before adding your answer.

Traffic: 2850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6