Question

MUMmer alignments: Trying to understand the average identity result

0

Entering edit mode

5.0 years ago

caro-ca ▴ 20

Hi, community! I am de novo assembling Nanopore long reads and I am comparing my draft genome assembly against the online available reference genome. First, I want to give you some details about the inputs. The reference genome used a hybrid method that compromised Illumina and PacBio; they assembled the short reads and used the long reads (~30x coverage) to gap-filling. In my assembly, I could find complete chromosomes but nothing on the reference genome. The genome annotation of the reference genome could annotate more protein-coding genes than my own assembly. Both assemblies are from the same species but different strains and they are different in genome size.

When I ran nucmer from MUMmer with the option -maxmatch and delta-filter I got an average identity of 93%. How is this possible? I find it difficult to understand because:

1) My assembly had ~47x coverage which according to the literature I needed 70x coverage to overcome the systematic errors of Nanopore so my assembly has errors even though I did a lot of error-correction, consensus and polishing steps.

2) With the Nanopore assembly, I could span more repetitive regions than the reference genome.

So, In general, there are different reasons why I would have expected a little bit less of the identity percentage.

Let me know if I made myself clear. Thank you in advance for your help.

MUMmer de novo assembly average identity • 1.9k views

ADD COMMENT • link updated 5.0 years ago by Istvan Albert 102k • written 5.0 years ago by caro-ca ▴ 20

1

Entering edit mode

My assembly had ~47x coverage which according to the literature I needed 70x coverage to overcome the systematic errors of Nanopore

It's important to realize here how fast the field changes. Nanopore errors have become a lot less problematic than some old literature will make you think. I don't know the paper you are talking about here, but make sure to take into account that the technology was of lesser quality when those people generated their data.

ADD REPLY • link 5.0 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi! Thank you for your answer. Actually, I am dealing with that coverage value. What is low or high coverage to overcome systematic Nanopore errors? Could you suggest me a paper which clarifies this? The one I read is from 2015.
Thank you!

ADD REPLY • link 4.9 years ago by caro-ca ▴ 20

score 2 · Accepted Answer · 2019-11-15

The question is not that well-formed - that is one reason you are not getting more answers. The most obvious explanation is quite straightforward, but most likely will not directly provide you with the information you seek:

You get a 93% average identity because that is how many bases are identical, on average.

You are also saying: I would have expected a little bit less of the identity percentage.

First I would say that it is quite challenging to accurately predict how similar two strains are. An average 93% percent identity does not mean the sequences are 93% identical everywhere. There will be long regions of identical sequences punctuated by shorter regions of great diversity - that is how closely related sequences should look like.

7% difference could a be a lot, or very little - it all depends on how closely packed the information in the genomes are. My gut feeling would say that a 7% difference across strains is a little too much already.

If you suspect your alignments to be the culprit then look and evaluate the alignments themselves. One of the easiest ways to compare two similar sequences is to use the minimap2 aligner. Basically align the draft genome to the reference then visualize the resulting BAM file in IGV. It can be very informative.