Hello,
I have a couple of bam files from exome sequencing and a list of regions that I want to perform variant calling and variant annotation on. My goal is to look for evidence of mutations in those regions that go beyond canonical LOF protein coding mutations.
I am thinking of extracting the regions of interest from a bam, converting to a vcf and then annotating it using snpEff or something like that. The following is my commandline:
# add read groups
picard AddOrReplaceReadGroups I=sample.bam O=sample.fixed.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20
# run gatk
java GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta
-I sample.fixed.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o sample.gvcf
# run snpeff
snpeff -c snpEff.GRCh37.config -ud 10 -classic GRCh37.75 sample.gvcf > sample.snpeff.gvcf
Q1. Is my approach correct? Is this approach applicable to WGS and RNASeq as well?
My vcf file generated in this manner has missing ALT info - it says <non_ref>. How can I fix this?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 20
chr4 54243811 . C <NON_REF> . . END=54243904 GT:DP:GQ:MIN_DP:PL 0/0:208:99:129:0,120,1800
chr4 54243905 . T C,<NON_REF> 10162.77 . DP=280;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.94 GT:AD:DP:GQ:P
L:SB 1/1:0,268,0:268:99:10191,805,0,10191,805,10191:0,0,160,108
chr4 54243906 . C <NON_REF> . . END=54244236 GT:DP:GQ:MIN_DP:PL 0/0:170:99:37:0,99,1485
chr4 54244237 . T <NON_REF> . . END=54244238 GT:DP:GQ:MIN_DP:PL 0/0:36:90:35:0,90,1350
chr4 54244239 . C <NON_REF> . . END=54244241 GT:DP:GQ:MIN_DP:PL 0/0:34:72:34:0,72,1080
chr4 54244242 . G <NON_REF> . . END=54244244 GT:DP:GQ:MIN_DP:PL 0/0:33:63:33:0,63,945
chr4 54244245 . C <NON_REF> . . END=54244247 GT:DP:GQ:MIN_DP:PL 0/0:41:78:39:0,78,1170
Q2. Also, can someone suggest a better approach for creating a vcf for just specific regions of bam file?
Why don't you call all SNPs and extract SNPs overlapping regions of interest using
intersectBed
?I tried the method below (adding -L parameter to GATK) and it is really fast compared to calling all the SNPs and then intersecting.