Question

Error rate in sequencing?

0

Entering edit mode

7.1 years ago

bioplanet ▴ 60

Hello,

Suppose you have a file with, say, 2M reads from Illumina sequencing.

You also have a region of interest, that you want to know if it was modified by Cas in a Crisp-Cas experiment. If Cas is inactive, then the region should stay intact, if not, it will have modifications, which include either nucleotide changes or small deletions.

My question is how I can estimate whether a variation that I might notice is actually because of Cas or a sequencing error.

My approach is, after mapping the reads to my reference construct, take this region of interest and explore the variation across all the reads. I read that Illumina sequencing error is, traditionally, set to ~0.1%. Does that mean that, for any given base in my 20-nt long region of interest, I can expect 0.1% of the reads to have a mismatch as compared to the expected nucleotide (i.e. the one I have from my reference construct)?

Is this a correct assumption to make? If not, how can I evaluate whether a variation I am observing in a given position out of these 20nts comes from sequencing error or that actually is a result of Cas modification?

Thank you!

sequencing next-gen • 1.9k views

ADD COMMENT • link updated 7.1 years ago by Kevin Blighe 88k • written 7.1 years ago by bioplanet ▴ 60

score 0 · Answer 1 · 2017-10-31

Your question appears to relate to both the inherent sequencing errors with Illumina sequencing (of which there are many) and also CRISPR.

I cannot comment specifically on the error rates of the Illumina protocol, however, the massively-parallel nature of NGS generally aims to overcome these inherent errors. The idea is that, by 'over' sequencing, we drown out these error base-calls that may invariably only appear in a single read in the entire sample library. On the other hand, if a genuine variant exists, it should be present in a statistically significant proportion of reads such that it could easily be identified by a variant caller. This is why lowering QC thresholds in NGS is 'dangerous', because it increase the false-positive variant call-rate.

In relation to CRISPR, in the ideal situation, you would have done targeted, phased sequencing of your region before and after the CRISPR experiment. In this way, you would have allele-specific information on the PAM site at which the expected cuts were to be made, and this would greatly assist in filtering out potential false single nucleotide variants and indels, i.e., they should only be identified on the same allele on which the PAM is located. Due to the fact that you'd also have before and after data, you could also filter out germline SNPs and indels too.

From my experience, and I have been working on this the past year, you will see indels of varying sizes at your PAM site, and some may even encompass the PAM site and apparently wipe it out completely. I am still attempting to understand the data myself.

If you are just looking at a single region, though, then just do multiple Sanger runs, as that will be more accurate and precise than NGS(?). NGS data is messy and requires a lot of QC!

Kevin