This post is a follow-up to the question asked by gerrybio2010 How To Find Repeat Expansion Using Exome/Genome Sequencing Data? on short tandem repeats/trinucleotide repeats/etc detection in NGS data. Not everyone may find this interesting, but in the world of gene discovery in human neurologic disease, this is really important.
I've been talking about this to people much smarter than me for a while now, and I'm still a bit unsure as to the core of the issue when it comes to the detection of STRs/etc by next-gen techniques, so I wanted to poll people's understanding here as well. Some say it's a sequence capture issue (and tied into the amplification bias inherent in whole exome capture protocols...whole genome sequencing ought to alleviate these) and others say it's mainly an analysis issue for the obvious read-length and repeat reasons. But interestingly people in other labs I've talked to don't agree on this.
So is the difficulty detecting short tandem repeats due to:
1) Issues with these regions during library prep (for whole exome) -- i.e. poor amplification of GC-rich repeats
2) Issues with mapping - can't align repeats uniquely, and the repeats may be longer than the read length
3) A swampy combination of both of the above
4) No one really knows, this is not a low-hanging-fruit kind of a problem, hopefully someone else will really start working on it.
My impression reading the literature is that short tandem repeat regions are by-and-large captured during library prep, but the issue is one of analysis of the sequence data and identifying them. For example, see Kozlowski et al.
These blog posts are also interesting in this discussion:
http://www.cureffi.org/2012/12/27/how-to-identify-a-disease-associated-repeat-expansion/
http://www.cureffi.org/2013/01/08/calling-repeat-length-polymorphisms-with-lobstr/
So second question: Has anyone else used lobSTR on control data known to harbor expanded STRs? Were those regions detected?
And a final question: I like the idea of calling something by its absence in sequence data -- perhaps to flag regions for follow-up studies in the wet lab. It should be possible to detect regions where -- given good overall read depth -- reads are missing, and may be indicative of STR and STR expansion. Does anyone have any experience with this sort of analysis of their data?
In my experience of analysing large viral genomes of 1.8Kb ;-) it has mainly been an issue with assembly. The problem has usually been that the di-nucleotide and tri-nucleotide repeat regions have usually been longer than the read length and consequently the assembly programs whether de-novo or reference based do not know how to bridge the repeat region. We usually resort to designing primers either side of the repeat and Sanger sequencing.
It's an interesting question, and my sense is that, in the case of exome-based detection, the capture itself can be an issue because the probes are designed based on the reference, AND, in the case of HD, the repeats are very GC-rich. If the capture is poor (as is suggested by the blogs you cite), the detection signals (similar to SV detection) will be poor as well. Are you aware of any published exome of genome datasets from patients with any of these disorders?