Called variants in VCF file are mapped to untarged regions.. Does it make sense?
2
0
Entering edit mode
9.5 years ago
illinois.ks ▴ 210

I use Illumina Miseq trusight one sequencing panel to get the Ngs data (which cover about 5000 genes), so 5000 genes are target region.

From the fastq file, I followed the GATK pipeline and SnpEFF to call variants.

However, when I mapped the called variants to genes, I realized that out of 323 genes (variants are mapped to genes), 100 genes are not in our target regions..

I understand that due to the experimental error,

  1. non-targeted regions could be captured.. so the read could be come from there
  2. Due to the mapping error, it could be happened...

However, 100 out of 323 genes seems to be large. Isn't it? Could you please somebody give comments for this?

(BTW, One other suspect is that for mapping, I have used tophat .. for whole genome.. based on the following command

tophat -p 4 -G genes.gtf -o xxx_thout -no-novel-juncs genome xxxR1.fq xxxR2.fq

(Do I have to force to be mapped to our targeted regions (eg. bed file) Or does it okay??)

Then, based on the accepted.bam file, I followed the GATK pipeline... Am I right?

Thanks in advance

vcf tophat gatk targed-sequencing • 2.6k views
ADD COMMENT
1
Entering edit mode
9.5 years ago

Why are you using Tophat for variant calling? It's not very robust against variations from the reference, and is designed for RNA-seq transcript quantification.

But there's nothing wrong with calling variants outside your targeted regions, as long as the depth is sufficient for confidence. Never try to force reads to map to the target only or you will greatly increase the false variant rate.

ADD COMMENT
0
Entering edit mode

Thank you so much Brian,

I see that so it is better to use other aligning tool such as bwa, bowtie2? I thought that since tophat also uses bowtie2 as inside engine, it would be okay. But I have to keep in mind for the next time I'd better use other aligning tool for this.

Anyhow, as you mentioned, although one third of my called variants are in untargeted regions, it would be okay as for as my read depth is sufficient? (I have used the threshold 20, is it okay?)

In this case, can I also trust these variants which are in untargeted regions, or do I have to throw away?

I am trying to find the disease-related variants. So I want to be somehow conservative.

Thanks

ADD REPLY
0
Entering edit mode

There is no reason not to trust variants in non-targeted regions. And 20 is certainly quite high for a minimum depth; I'd generally go lower.

"Conservative" is kind of a loaded term. Arbitrarily excluding potentially causal variants is not, typically, what I would thing of as conservative... rather, conservative would mean excluding only the variants that you can say with high confidence are either not real or not causal. I suggest ranking detected variations by both probability of being real and probability of being harmful and focusing on the top of the list.

ADD REPLY
0
Entering edit mode

BTW, do you think I need to re-align my fastq file using bwa or bowtie2 instead of tophat?

Or is it okay I can use my aligned bam file by tophat for variant call as it is?

ADD REPLY
1
Entering edit mode

You definitely need to remap with something else.

Personally, I would recommend BBMap, but I am biased in this case. Regardless, Tophat is not the right choice.

ADD REPLY
1
Entering edit mode
9.5 years ago
mhockin ▴ 610

Targeted or not- if your looking at exons and your seq map depth is enough to call variants accurately, I would not hesitate to call variants with confidence. According to my understanding, Modern mapping software is sufficiently accurate that if you have a single genome in your sequencing template, and your mapping reads across the entire genome, a very large fraction of coding regions will be uniquely mappable- that is to say that any read mapping to a exome sequence is highly likely to be mapping correctly, therefore I would say that your ability to call SNP's on this read fraction is probably high.

That said, I don't immediately have a literature citation to back this up- perhaps others will chime in with better formulated answers.

ADD COMMENT
0
Entering edit mode

Thank you so much for your comments. It really helps!!

ADD REPLY

Login before adding your answer.

Traffic: 1514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6