Hello,
I am new to exome sequencing data anslysis and want to ask questions regarding what to do in my situation. I have spent quite a time to figure this out by myself, but since there's no one around me to direct me, I couldn't get much. I am given vcf files suspected to be from GATK UnifiedGenotyper on case and control samples (A1 and A2 are our case and B1 is our control), namely case[or control].indel.raw.vcf, case[or control].snp.raw.vcf, case[or control].var.raw.vcf. Now, I need to identify 1) rare variants ((SNPs or indels) with frequency less than 0.01% in EXaC or GenomAD) present only in the two cases and not in the control. 2) PolyPhen/SIFT or other scores for the identified rare variants.
My questions are 1) GATK manual says that, since UnifiedGenotyper would produce many false positives, these files need to go through a lot of filtering processes. However, I can't find in the manual what kind of filters I need to apply using what kind of tools. 2) Can you please give me the pipeline to identify rare variants and PolyPhen/SIFT scores from the vcf files?
Thank you very much for your time.