Entering edit mode
4 weeks ago
Joel Wallenius
▴
210
Hello,
sequencing the GBA1 gene is important in hereditary Parkinson's disease research. Just upstream of GBA1 is a pseudogene GBAP1 with 90+ % sequence identity. This means WES capture kits include the pseudogene unless specifically designed to exclude it (not the case for us, unfortunately).
Illumina offers Gauchian to call variants in this region: https://github.com/Illumina/Gauchian
But it only works on WGS data.
I wonder if anyone here is aware of a tool or approach that helps analysis of this problematic region, when working with WES data?
Thanks in advance!
if i understand correctly you want to call variants?
in that case, you can use mutect https://gatk.broadinstitute.org/hc/en-us/articles/360037593851-Mutect2
if CNV https://boevalab.inf.ethz.ch/FREEC/
https://www.bioconductor.org/packages/release/bioc/html/PureCN.html
Hello Odin, variants have been called already, this post is about the trouble with interpreting them, is the read/variant from the pseudogene or the real gene?
Is this a single target capture or are there multiple targets? Have you tried to align your data to the region (including the pseudogene) (if single target) and/or to the entire genome (if multiple targets)?
Hello GenoMax!
We have WES data, so all canonical human exons are meant to be captured and sequenced (by using flanking primers, i.e. homologous regions like the pseudogene GBAP1 would be included by accident). I'm not sure what you're suggesting, could you rephrase please?
I was trying to see if you have done some analysis which it looks like you have. It would be hard to distinguish between alignments to the pseudogene and real gene unless that particular read is aligned in a region of sequence difference. I assume you have short reads so you can't completely capture the complete genomic context.
Yes sadly, we have 75bp and 150bp paired-end reads... I was hoping there'd be some specific tool to "resolve" this particular problematic region. I saw a paper where they had compared the read depth between the two regions, and basically said "if the read depths are sufficiently different at the locus of some detected variant, we can't say whether it's the real gene or the pseudogene". I suppose they're reasoning that depth discrepancy implies reads being mapped i.e. variants called incorrectly.
Long-read sequencing would solve everything of course, but we're talking about ~ 30,000 people, the cost of long-read sequencing is just ridiculously beyond our budget...