Convert specific regions in bam file to vcf and annotating the vcf
1
1
Entering edit mode
8.4 years ago
komal.rathi ★ 4.1k

Hello,

I have a couple of bam files from exome sequencing and a list of regions that I want to perform variant calling and variant annotation on. My goal is to look for evidence of mutations in those regions that go beyond canonical LOF protein coding mutations.

I am thinking of extracting the regions of interest from a bam, converting to a vcf and then annotating it using snpEff or something like that. The following is my commandline:

# add read groups 
picard AddOrReplaceReadGroups I=sample.bam O=sample.fixed.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20

# run gatk 
java GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta
-I sample.fixed.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o sample.gvcf

# run snpeff 
snpeff -c snpEff.GRCh37.config -ud 10 -classic GRCh37.75 sample.gvcf > sample.snpeff.gvcf

Q1. Is my approach correct? Is this approach applicable to WGS and RNASeq as well?

My vcf file generated in this manner has missing ALT info - it says <non_ref>. How can I fix this?

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  20
chr4    54243811        .       C       <NON_REF>       .       .       END=54243904    GT:DP:GQ:MIN_DP:PL      0/0:208:99:129:0,120,1800
chr4    54243905        .       T       C,<NON_REF>     10162.77        .       DP=280;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.94       GT:AD:DP:GQ:P
L:SB       1/1:0,268,0:268:99:10191,805,0,10191,805,10191:0,0,160,108
chr4    54243906        .       C       <NON_REF>       .       .       END=54244236    GT:DP:GQ:MIN_DP:PL      0/0:170:99:37:0,99,1485
chr4    54244237        .       T       <NON_REF>       .       .       END=54244238    GT:DP:GQ:MIN_DP:PL      0/0:36:90:35:0,90,1350
chr4    54244239        .       C       <NON_REF>       .       .       END=54244241    GT:DP:GQ:MIN_DP:PL      0/0:34:72:34:0,72,1080
chr4    54244242        .       G       <NON_REF>       .       .       END=54244244    GT:DP:GQ:MIN_DP:PL      0/0:33:63:33:0,63,945
chr4    54244245        .       C       <NON_REF>       .       .       END=54244247    GT:DP:GQ:MIN_DP:PL      0/0:41:78:39:0,78,1170

Q2. Also, can someone suggest a better approach for creating a vcf for just specific regions of bam file?

exome sequencing bam vcf subset • 2.9k views
ADD COMMENT
0
Entering edit mode

Why don't you call all SNPs and extract SNPs overlapping regions of interest using intersectBed ?

ADD REPLY
0
Entering edit mode

I tried the method below (adding -L parameter to GATK) and it is really fast compared to calling all the SNPs and then intersecting.

ADD REPLY
3
Entering edit mode
8.4 years ago

You can use the -L flag of HaplotypeCaller to limit the intervals in which variants are detected as specified by a bed file.

--intervals / -L One or more genomic intervals over which to operate Use this option to perform the analysis over only part of the genome. This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the command line (e.g. -L chr1 or -L chr1:100-200) or by loading in a file containing a list of intervals (e.g. -L myFile.intervals). Additionally, you can also specify a ROD file (such as a VCF file) in order to perform the analysis at specific positions based on the records present in the file (e.g. -L file.vcf). Finally, you can also use this to perform the analysis on the reads that are completely unmapped in the BAM file (i.e. those without a reference contig) by specifying -L unmapped.

from https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_engine_CommandLineGATK.php

ADD COMMENT
0
Entering edit mode

Thank you! That worked and is quite fast because I was only looking at a small number of genes.

ADD REPLY
0
Entering edit mode

This link doesn't work

ADD REPLY

Login before adding your answer.

Traffic: 2227 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6