I use Illumina Miseq trusight one sequencing panel to get the Ngs data (which cover about 5000 genes), so 5000 genes are target region.
From the fastq file, I followed the GATK pipeline and SnpEFF to call variants.
However, when I mapped the called variants to genes, I realized that out of 323 genes (variants are mapped to genes), 100 genes are not in our target regions..
I understand that due to the experimental error,
- non-targeted regions could be captured.. so the read could be come from there
- Due to the mapping error, it could be happened...
However, 100 out of 323 genes seems to be large. Isn't it? Could you please somebody give comments for this?
(BTW, One other suspect is that for mapping, I have used tophat .. for whole genome.. based on the following command
tophat -p 4 -G genes.gtf -o xxx_thout -no-novel-juncs genome xxxR1.fq xxxR2.fq
(Do I have to force to be mapped to our targeted regions (eg. bed file) Or does it okay??)
Then, based on the accepted.bam file, I followed the GATK pipeline... Am I right?
Thanks in advance
Thank you so much Brian,
I see that so it is better to use other aligning tool such as bwa, bowtie2? I thought that since tophat also uses bowtie2 as inside engine, it would be okay. But I have to keep in mind for the next time I'd better use other aligning tool for this.
Anyhow, as you mentioned, although one third of my called variants are in untargeted regions, it would be okay as for as my read depth is sufficient? (I have used the threshold 20, is it okay?)
In this case, can I also trust these variants which are in untargeted regions, or do I have to throw away?
I am trying to find the disease-related variants. So I want to be somehow conservative.
Thanks
There is no reason not to trust variants in non-targeted regions. And 20 is certainly quite high for a minimum depth; I'd generally go lower.
"Conservative" is kind of a loaded term. Arbitrarily excluding potentially causal variants is not, typically, what I would thing of as conservative... rather, conservative would mean excluding only the variants that you can say with high confidence are either not real or not causal. I suggest ranking detected variations by both probability of being real and probability of being harmful and focusing on the top of the list.
BTW, do you think I need to re-align my fastq file using bwa or bowtie2 instead of tophat?
Or is it okay I can use my aligned bam file by tophat for variant call as it is?
You definitely need to remap with something else.
Personally, I would recommend BBMap, but I am biased in this case. Regardless, Tophat is not the right choice.