Question

Larger genome size than expected-please help

0

Entering edit mode

8 days ago

alexandrakortsi • 0

I have sequenced a fungal genome using nanopore technology. I have tried diferrent assemblers and parameters (necat, flye etc) and all of the final assemblies result in a genome larger than expected for this species. Exprected genome size is 32-34 million and my assemblies are around 39 million bp. So far :

Busco results show 97% score whithin the fungal Order, which is pretty good, and only 24 duplicates.
Raw read mapping to the assembly showed 99. 88% score and around 385.000 bp were found to have low coverage threshold, while 140.000 bp had a very high coverage threshold.
reference mapping with another strain showed 100% mapped reads BUT 5411 supplementary alignments , 311 non primary alignments and 6.8% error rate. Cigar alignments without clipping showed 34Mbp.

Could someone please help me understand if these data indicate an existing problem in my assembly and how could i proceed in order to find out if these extra million bp are real (seems unlikely) and how could i correct any mistakes?

genome nanopore size minimap2 • 815 views

ADD COMMENT • link 3 days ago by alexandrakortsi • 0

0

Entering edit mode

if these extra million bp are real (seems unlikely)

Why do you feel that way? Have you aligned your assembly to an available reference and checked where these "extra" sequences are located? Are they in contiguous chunks? If so, they may be really present in your genome.

ADD REPLY • link 8 days ago by GenoMax 150k

0

Entering edit mode

Because it is a very common species and there are 15 available genomes around the size i mentioned. The problem is that they all have many contigs (over 300), so i dont know how i can manually see where the diferrences are. IGV seems a little complicated to manually check 300 contigs. Do you have know any other possible ways to do that?

ADD REPLY • link 8 days ago by alexandrakortsi • 0

0

Entering edit mode

The problem is that they all have many contigs (over 300)

That may indicate that these may not be "complete/reference" quality. How many contigs did you end up with? Is that more in line with expected number of chromosomes?

ADD REPLY • link 8 days ago by GenoMax 150k

score 1 · Answer 1 · 2025-04-02

Genome assembly is still a trial and error process, but getting 39 million genomes for a 34-million-long expectation is not too bad.

Often, reference genomes don't contain all the information - especially those that are hard to assemble, repeating, or low complexity regions.

Consider how long it took to produce an accurate human genome—about twenty years after the completion of the Human Genome Project was announced.

Sometimes there are contaminants of various kinds in the data that may also get assembled into extraneous contigs.

I recommend generating dot-plots relative to the expected genomes to determine how your genomes differ. Dotplots might be able to tell you whether the extra regions are within existing chromosomes:

https://github.com/MariaNattestad/dot

score 1 · Answer 2 · 2025-04-03

1

Entering edit mode

7 days ago

colindaven 7.4k

Another possibility could be that the other assemblies are older. Nanopore seq tech has become very useful of late and generates good and long assemblies, especially since the change to 10.4.1 data.

I would argue that your assemblies are likely better, and likely just contain more repeat sequences which were collapsed in older assemblies.

As regarding contamination, you can check the reads with Kraken/Centrifuge etc. If you see problems try to exclude those reads and reassemble.

To check the final assembly (if needed!), I would try sourmash subtools tax. Or you can call genes, eg with helixer for fungi or with augustus, and then blastx with CDS sequences against protein databases.

ADD COMMENT • link 7 days ago by colindaven 7.4k

0

Entering edit mode

Thank you very much for your response. There is a chance that what you are saying is correct, since every assembly method i've tried produces the same result. the problem is, if repeats that were not collapsed is the reason for the bigger size, how do i prove it? I mean how do i prove that they are actual repeats existent in the genome and not an assembly error?

ADD REPLY • link 7 days ago by alexandrakortsi • 0

1

Entering edit mode

There is nothing to prove per se. If you have multiple long reads in your data (advantage of nanopore over other short reads) that actually span the entire assembled repeat(s) then they are real. You have an assembly in hand (and admittedly you get to that point multiple ways) so you can submit it to GenBank/ENA along with the raw data.

You did not say how many contigs your assembly has and if the number is closer to expected chromosomes based on ploidy. Based on what you have been saying it must be better than 300 contigs that existing assemblies have.

ADD REPLY • link 7 days ago by GenoMax 150k

1

Entering edit mode

The critical thing to remember is that there will be information cannot be "proven" with your data because the read length imposes inherent limitations.

Different assemblies may be fully compatible with the information in the reads. Longer reads indicate a haplotype, but until we get to a length where a read contains a full genome, there can always be uncertainties.

What you can do is support the existing pieces of evidence.

For example, once you make a dot plot (or via some other methods), you may find specific extraneous sequences relative to the reference.

Now, when aligning reads to these regions, if you observe reads that start in the reference and continue into the newly inserted region, then you have evidence that the DNA in your sample supports that change. That's all the proof you need.

For what it is worth, assembly is still a black box, and it is challenging to understand and validate the internal decision-making that goes into the process.

ADD REPLY • link 7 days ago by Istvan Albert 102k

1

Entering edit mode

You can map short or long reads to both assemblies and compare coverage. Lets say a repeat or gene is present 3x in reality and 3x in your modern assembly. The old assembly has only one copy. You would then expect a coverage peak of 3x the average coverage in the old assembly containing one copy. Of course this is never as clear as this example.

I find Svim-asm with bed output to be a very effective tool at pairwise comparison of genome assemblies. Also dotplots as others have mentioned, but I prefer looking at bed files with genes in a genome browser. You can use the online version of helixer to easily annotate your genomes.

ADD REPLY • link 6 days ago by colindaven 7.4k

0

Entering edit mode

Thank you all for the advice!! My assembly has 66 contigs. I found someone that had a similar situation and his advice was to map reads to the assembly and locate supplementary alignments and high coverage regions. If there are overalpping contigs that have both many supplementary alignments and high coverage regions then they are most likely suspicious for duplications/misassemblies. Is this a good startegy to follow besides your advice?