How does long reads help in the repeated regions of the genome
2
0
Entering edit mode
4.2 years ago
Ashi ▴ 20

Hi,

I am doing some work with long reads and I have read in many review papers what are advantages and disadvantages of long reads over short reads. One of the most important advantage of Long reads that I have come across is that they help in the assembly of the highly repeated regions whereas short reads fail to do that.

Is there any example or an article which will help me to understand (basically visualize) how long reads are helping in these repeated regions?

Thank you so much for all the help.

Illumina short-reads long-reads ONT • 1.6k views
ADD COMMENT
4
Entering edit mode
4.2 years ago
Mensur Dlakic ★ 28k

When individual reads are shorter than the repeats, and especially when repeats are highly similar, it is not possible to unambiguously map the reads onto genome. That means it is not possible to unambiguously determine the number of repeats.

You can do this as an exercise: make a schematic representation of 5 repeats, and place 50 random reads shorter than repeats across that region. Let's say that this results in 8x coverage. If you did that exercise for a non-repetitive region, you'd be able to reconstruct the whole sequence without much problem. The problem with repeats is that read #1 may be from repeat #1, but you'll be able to map it to any of the 5 repeats if they are similar enough. Even reads that span two repeats will map to more than one part of the genome, again assuming that reads and their intervening sequences are similar enough. This will usually lead to the shrinkage of the repetitive region, and we will end up with fewer than 5 repeats.

The problem is not limited to individual genomic regions. If there are repeats in two physically separate parts of the genome, the assembler will entangle them when resolving the graph nodes. That will lead to the collapse of both contigs, and the repeats may be assigned to either of them or even end up in a separate contig. That's illustrated below: red and cyan are different contigs that share a repeated part in the middle.

enter image description here

Google contig collapse repeat and there should plenty of material that will hopefully explain it to your liking.

Long reads overcome this problem if they are long enough to go through the whole repeated regions. If a long read starts in a unique part of the genome, goes through the repeat and ends up also in another unique part, there is nothing ambiguous in how that read can be overlapped with others because the whole repeated region is contained within that long read. Even when long reads can't go through the whole repeated region, they are usually better in resolving repeats as long as they catch several repeats that may have one or two unique mutations. That greatly reduces the number of possibilities during the long-read overlap compared to short reads.

ADD COMMENT
0
Entering edit mode

Thank you for the explanation. It is really helpful.

ADD REPLY
2
Entering edit mode
4.2 years ago

In short, long reads are more capable of containing the entire repeated region, therefore it can be read as a whole instead of being reconstructed by trying to map short reads without enough anchoring points at both sides of the problematic region.

Short reads coming from a repeated region will map with equal probability in several places of that region, therefore the mapping quality of those reads will be 0 and they won't be considered in any downstream analysis.

ADD COMMENT
0
Entering edit mode

Thank you for this simple explanation. I get the full picture now.

ADD REPLY

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6