Question

Which Read-Depth is Needed to Find Insertions?

0

Entering edit mode

8.2 years ago

michaela_boell ▴ 70

Hello,

According to a review by D Sims et al [1], the required average mapped depth is 35× in order to detect heterozygous SNVs and indels in resequencing studies. I wonder why it is so high. Is this only due to the non-uniformity of read-depth? And would this be higher if one sets out to find large insertions?

1 Sims, D., Sudbery, I., Ilott, N. E., Heger, A., and Ponting, C. P. (2014) Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15, 121-132

sequencing alignment • 1.9k views

ADD COMMENT • link updated 8.2 years ago by Brian Bushnell 20k • written 8.2 years ago by michaela_boell ▴ 70

0

Entering edit mode

Tagging i.sudbery

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

score 3 · Accepted Answer · 2017-01-15

3

Entering edit mode

8.2 years ago

Brian Bushnell 20k

It's very easy to discover short features (substantially shorter than read length, like a 20bp indel for 150bp reads) with very low depth sequencing (say, 3x) with some degree of certainty. In that case, more coverage simply buys more confidence. Heterozygous features on a diploid require more coverage than homozygous features, of course. But the coverage needed is a function of the read length, error rate, sequencing methodology (paired or unpaired? insert size distribution?), what you're looking for (novel or well-known variants?), genome type (complexity and repetitiveness), biases present in sequencing, and so forth. I have not read the paper, but to make any claim of a specific level of coverage required, these factors need to be specified along with a given level of confidence desired. I've called short heterozygous variations with low (<10x) coverage using short, high-error-rate, highly-biased 50bp colorspace reads. The confidence level can be quite good even in that very undesirable case (>95% accuracy), as determined by sequencing trios. Generally, though, I don't recommend it. However, with high-quality 2x250bp reads using methodology which incurs little bias, you could get extremely high confidence het calls with coverage much substantially lower than 35x.

There is a difference, though, between claiming a single variant is a high-confidence real heterozygous variant (which can be done at 10x) and that you have, with high confidence, found virtually all real heterozygous variants and correctly classified them as heterozygous. The latter can't be done at 10x regardless of your methods (for bulk-DNA shotgun sequencing), because best-case scenario, there's still a 1/512 chance that all reads covering a given location came from the same chromosome copy (that's (1/2^10)*2)).

Features similar to or longer than read length, such as long insertions and other structural variations, need to be implied by coverage rather than called within a single read, though. So they are more coverage-dependent.

ADD COMMENT • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you very much! What is your estimate, would you try to find insertions that are long, but you only care about the ends, with short reads, 50 bp or 75 bp SE ?

ADD REPLY • link 8.2 years ago by michaela_boell ▴ 70

0

Entering edit mode

How long are the insertions?

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Roughly 900bp. I actually have a dataset of 75 bp mate-pair PE reads but I am investigating different alignment approaches.

ADD REPLY • link 8.2 years ago by michaela_boell ▴ 70

1

Entering edit mode

PE reads would be best for this kind of situation. The difference between 50 and 75 bp SE reads will be relatively small, but of course 75 is still better since you get a longer anchor. I think, finding the reads that map to the edge of the insertion even and trying to extend them by ~1kbp in the direction of the insertion using assembly techniques is the only way to recover the sequence in the insertion. If you just want the genomic coordinates of the insertion and zygosity, 50bp or 75bp SE reads should be adequate, though I'm not sure what the proper tool is for the situation.

Edit - LMP reads might be useful if they have a fairly tight insert size in the realm of the 900bp you are looking for, though. What kind of insert size distribution does your long-mate-pair library have? Or am I correct in understanding you have a long-mate pair library?

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

We decided on PE reads because we suspected that there cases where additional fragments are inserted between the T-DNA (something in principle like a transposon) and the genome. This makes the border larger, in a way. The tool I used was Bowtie2 on a reference that included the T-DNA sequence as an additional chromosome. Then I filtered for discordant mates that mapped on different chromosomes.

ADD REPLY • link 8.2 years ago by michaela_boell ▴ 70

1

Entering edit mode

If you know what the inserted sequence is, then you can add it as a contig to your reference genome and your problem collapses to translocation detection.

More generally, PE approaches have a much stronger signal as (if your insertions are novel with respect to your reference), you can use read pairs in which only one side aligns as part of your signal. This expands this size of your signal from 75bp to (fragment size - 75) which will give you a much stronger signal than SE reads alone. Specialised callers such such as NovelSeq are designed for this sort of scenario.

ADD REPLY • link 8.2 years ago by d-cameron ★ 2.9k