Question

Troubleshooting assembly contigs of large genome

0

Entering edit mode

6.2 years ago

arunprasanna83 ▴ 60

Hello,

Following is my strange situation: I assembled genomes from same sample (haploid source) with two different methods. Assembly size of two methods are the following:

Method1 = 900 Mb (5400 contigs >10kb)

Method2 = 500 Mb (7000 contigs >10kb)

I suspected duplication in Method 1 and checked for completeness with BUSCO. Surprisingly both the methods gave similar completeness values with no diploid in Method1. Hence, I am highly curious to know where the extra 400 Mb is coming from. For this, I am trying to align the sequences and visualize them. But due to large file size almost most of the methods are failing. For instance, I tried

minidot - error at installation level after repeated attempts
LASTZ alignment -> maf -> aliTV. It fails in the alignment step itself
mummer/nucmer --> the given length exceeds allowed limit (I am using 64-bit version, still fails)
LAST generates around 300 GB of MAF file, which is not readable by any downstream application
Gepard - hangs !

I feel like hitting the dead-end. Kindly let me know, how to handle this situation. I am very curious to know where this extra seqs are from !.

Thanks in advance.

genome Assembly sequencing alignment • 1.5k views

ADD COMMENT • link 6.2 years ago by arunprasanna83 ▴ 60

0

Entering edit mode

Your hunch is not that far off probably, likely it is indeed due to redundancy in method 1.

BUSCO might not show this because that is only looking at the genic part of the assembly, the redundancy might very well be in gene-poor (or even gene-less) regions.

Which version of mummer are you truing to run?

why not give good-old blast a try? if it's simply to get a first idea , you will be able to get that also from a blast(n) output

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

I am using mummer 3.2.3. As @gconception mentioned I will try 4. How does blastn help ?

ADD REPLY • link 6.2 years ago by arunprasanna83 ▴ 60

0

Entering edit mode

Well, you could quickly "align" the sequences to each other (blast set 1 against set 2, and/or vice versa) and see if for a single query you get multiple (2?) hits in the other one

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

If you want to use mummer, make sure you are using version 4.0.0 https://github.com/mummer4/mummer/releases

D-GENIES is another dotplot option that works well for large genomes: http://dgenies.toulouse.inra.fr/

Are these assemblies from long reads? What assemblers were used? FALCON & Canu?

ADD REPLY • link 6.2 years ago by gconcepcion ▴ 410

0

Entering edit mode

I used mummer 3.2.3. I will give 4 a try !. Btw. D-GENIES web version failed and a local installation is not friendly. The assemblies are from long reads and assembled with Canu.

ADD REPLY • link 6.2 years ago by arunprasanna83 ▴ 60