What kind of filtering can be applied to reduce my exonic variants?
3
2
Entering edit mode
7.7 years ago

Hi Folks,

I am carrying out Exomeseq analysis for trio (Son is affected, Father is affected, Mother is unaffected). I did following steps

  • Trimmed using Trimmomatic

  • Aligned using BWA mem, Sorted, Marked Duplicates and Recalibrated.

  • For all 3 samples the coverage for target bases at 20X - 92%, at 30X - 86%, 40X - 80%, 50X - 72% (Is this coverage good enough for exome sequencing?)

  • Variant calling using Unified Genotyper - GATK ( Is Unifiedgenotyper good for Trio analysis?)

  • Got raw variants (900,000), Filtered variants, Annotated with SnpEff

  • Selected PASS variants (750,000), Selected Exonic Variants (12,000)

  • Then I checked in disease-related genes (Got 15 variants)

  • Finally, I selected common variants between affected son and affected father ( 3000 Heterozygous and 2800 Homozygous alternate variants)

  • Planning to check in DBSNP.

  • What are the other general ways to reduce my variants?

Exomeseq SNP DNAseq Trio Variants • 2.3k views
ADD COMMENT
1
Entering edit mode
7.7 years ago

For all 3 samples the coverage for target bases at 20X - 92%, at 30X - 86%, 40X - 80%, 50X - 72% (Is this coverage good enough for exome sequencing?)

Those numbers look normal, but since those are averages it is still possible that there are regions with low coverage (unavoidable)

Variant calling using Unified Genotyper - GATK

Why not HaplotypeCaller?

What are the other general ways to reduce my variants?

We don't know which disease you are working on, but ExAC is probably a good resource to eliminate frequent variants. If you are looking for a highly penetrant mutation for a rare disease you can exclude all variants that are frequent (well above disease prevalence).

ADD COMMENT
0
Entering edit mode

Thanks, WouterDeCoster.

Yeah, I am also going to run HaplotypeCaller (with GVCFs) individually for 3 samples and then run joint genotyping. I thought of taking the union of calls from Unified genotyper and HaplotypeCaller. I thought those calls might true positives. Does it make sense or not?

I am working on hypercholesterolemia. Sure I will compare with ExAC. Can ExAC database be used to filter out variant with MAF >1% as frequent variants.

ADD REPLY
0
Entering edit mode

I thought of taking the union of calls from Unified genotyper and HaplotypeCaller. I thought those calls might true positives. Does it make sense or not?

Don't you mean the intersection rather than the union, to increase true positives?
Anyway, HaplotypeCaller is supposed to be superior, so I'm not sure if combining both would be an added value. You could even lose variants which were found by HaplotypeCaller but missed by unified genotyper.

I am working on hypercholesterolemia.

As far as I know that's not exactly a rare condition, right?

Sure I will compare with ExAC. Can ExAC database be used to filter out variant with MAF >1% as frequent variants.

You can download summary data from ExAC and use the vcf to annotate your variants for filtering of frequent variants.

ADD REPLY
0
Entering edit mode
7.7 years ago

Sorry, I meant intersection, not union. "Haplotypecaller" is recommended by GATK team. Last week, I ran my 3 samples individually using "Haplotypecaller" with GVCF option. It failed 6 times due to memory issue. So that's the reason I first tried with "Unified Genotyper". But now I have used other high computing machine to generate calls using "Haplotypecaller-GVCF". I got calls following calls from "UnifiedGenotypeCaller - 930,000 vcf records" and "HaplotypeCaller - 1.4 million vcf records".

1) Why am I getting more number of calls in Haplotypecaller-GVCF? or That's pretty normal to expect from it.

2) Also, I noticed that 200,000 vcf records fall under the below contigs (chr1_gl000191_random,...chr4_ctg9_hap1,chrUn_gl0000210..etc). What about the variants in these unknown contigs in hg19 reference genome?

enter image description here

3) As per GATK best practices, I cannot go for the VQSR approach for filtering, because I have only 3 samples, but they say at least 25 exome-sequenced samples needed for VQSR. They suggest either hard filtering or doing variant calling by adding some exome bam files from 1000 Genomes project. But, I did the hard filtering for filtering the raw VCF. Which one is the best option for TRIO analysis?

Hypercholesterolemia is a familial inherited disease.

Currently, I am working on annotating my calls with ExAC (used release 0.3.1 from FTP broad website), Clinvar (used vcf_GRCh37 from FTP NCBI clinvar site), DBNSFP using SnpEff.

ADD COMMENT
0
Entering edit mode
7.7 years ago

Is this for research purpose, clinical or something else?

You can also remove variants that are synonymous and add variants that are in splice sites (I prefer RefSeq transcrpts and ACMG recommends). You may find more recommendations in ACMG guidelines. Maybe in your case, the list of genes you are focusing on should be extended with the help of OMIM and HPO.

At ALAPY we developed ALAPY Genome Explorer to analyze vcf files, trios, filter variants based on different data sources like ExAC. This was done to test our database design, but you definitely can use it already for your study and it is free at the moment http://alapy.com/services/alapy-genome-explorer/ Register now to get free access. Also, you generated fastq files and we finished the first version of fastq compression program called ALAPY Compressor http://alapy.com/services/alapy-compressor/ It is also free and will reduce your fastq.gz (gzipped fastq) file sizes by 1.5 to 3 times. It is lossless compression.

We would love to hear back from you about ALAPY Compressor andALAPY Genome Explorer.

Thank you

ADD COMMENT
1
Entering edit mode

Hi Petr Ponomarenko,

This is a clinical as well as research study. Sure, I will try ALAPY and give my feedback on it.

ADD REPLY
0
Entering edit mode

Please feel free to ask questions here and on the website, especially if you want us to add/remove sources of data or functionality. Thank you. Petr

ADD REPLY

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6