Troubleshooting assembly contigs of large genome
0
0
Entering edit mode
6.2 years ago

Hello,

Following is my strange situation: I assembled genomes from same sample (haploid source) with two different methods. Assembly size of two methods are the following:

Method1 = 900 Mb (5400 contigs >10kb)

Method2 = 500 Mb (7000 contigs >10kb)

I suspected duplication in Method 1 and checked for completeness with BUSCO. Surprisingly both the methods gave similar completeness values with no diploid in Method1. Hence, I am highly curious to know where the extra 400 Mb is coming from. For this, I am trying to align the sequences and visualize them. But due to large file size almost most of the methods are failing. For instance, I tried

  1. minidot - error at installation level after repeated attempts

  2. LASTZ alignment -> maf -> aliTV. It fails in the alignment step itself

  3. mummer/nucmer --> the given length exceeds allowed limit (I am using 64-bit version, still fails)

  4. LAST generates around 300 GB of MAF file, which is not readable by any downstream application

  5. Gepard - hangs !

I feel like hitting the dead-end. Kindly let me know, how to handle this situation. I am very curious to know where this extra seqs are from !.

Thanks in advance.

genome Assembly sequencing alignment • 1.5k views
ADD COMMENT
0
Entering edit mode

Your hunch is not that far off probably, likely it is indeed due to redundancy in method 1.

BUSCO might not show this because that is only looking at the genic part of the assembly, the redundancy might very well be in gene-poor (or even gene-less) regions.

Which version of mummer are you truing to run?

why not give good-old blast a try? if it's simply to get a first idea , you will be able to get that also from a blast(n) output

ADD REPLY
0
Entering edit mode

I am using mummer 3.2.3. As @gconception mentioned I will try 4. How does blastn help ?

ADD REPLY
0
Entering edit mode

Well, you could quickly "align" the sequences to each other (blast set 1 against set 2, and/or vice versa) and see if for a single query you get multiple (2?) hits in the other one

ADD REPLY
0
Entering edit mode

If you want to use mummer, make sure you are using version 4.0.0 https://github.com/mummer4/mummer/releases

D-GENIES is another dotplot option that works well for large genomes: http://dgenies.toulouse.inra.fr/

Are these assemblies from long reads? What assemblers were used? FALCON & Canu?

ADD REPLY
0
Entering edit mode

I used mummer 3.2.3. I will give 4 a try !. Btw. D-GENIES web version failed and a local installation is not friendly. The assemblies are from long reads and assembled with Canu.

ADD REPLY

Login before adding your answer.

Traffic: 2342 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6