Too many variant calls from GATK pipeline using joint calling
1
1
Entering edit mode
6.9 years ago
Argossy ▴ 50

I have exome sequencing with 69 human samples and used the GATK pipeline with joint calling. But got too many variant calls after the joint calling (~2 million calls per subject) before filtering. even if after the filtering using VQSR and GQ > 20, I still have ~500 k of variant calls. anything wrong in my pipeline?

below is my pipeline:

sequencing snp • 2.1k views
ADD COMMENT
0
Entering edit mode
id=$id_input
dirin=$dir_input

bwa mem -M -t 16 ./ucsc.hg19.fasta  $dirin/$rawName_L002_R1_001.fastq.gz $dirin/$rawName_L002_R2_001.fastq.gz > $id_L002_001.sam
bwa mem -M -t 16 ./ucsc.hg19.fasta  $dirin/$rawName_L002_R1_002.fastq.gz $dirin/$rawName_L002_R2_002.fastq.gz > $id_L002_002.sam
bwa mem -M -t 16 ./ucsc.hg19.fasta  $dirin/$rawName_L001_R1_001.fastq.gz $dirin/$rawName_L001_R2_001.fastq.gz > $id_L001_001.sam
bwa mem -M -t 16 ./ucsc.hg19.fasta  $dirin/$rawName_L001_R1_002.fastq.gz $dirin/$rawName_L001_R2_002.fastq.gz > $id_L001_002.sam

# 3. Convert .sam to .bam and sort
line=$id_L002_001
samtools view -bS $line.sam > $line.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$line.bam OUTPUT=$line.sorted.bam SORT_ORDER=coordinate

line=$id_L002_002
samtools view -bS $line.sam > $line.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$line.bam OUTPUT=$line.sorted.bam SORT_ORDER=coordinate

line=$id_L001_001
samtools view -bS $line.sam > $line.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$line.bam OUTPUT=$line.sorted.bam SORT_ORDER=coordinate

line=$id_L001_002
samtools view -bS $line.sam > $line.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$line.bam OUTPUT=$line.sorted.bam SORT_ORDER=coordinate

# 4. Add read group
line=$id_L002_001
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar AddOrReplaceReadGroups INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGE_$id RGPL=ILLUMINA RGPU=machine RGSM=$id
samtools index $line.sorted.rg.bam

line=$id_L002_002
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar AddOrReplaceReadGroups INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGE_$id RGPL=ILLUMINA RGPU=machine RGSM=$id
samtools index $line.sorted.rg.bam

line=$id_L001_001
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar AddOrReplaceReadGroups INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGE_$id RGPL=ILLUMINA RGPU=machine RGSM=$id
samtools index $line.sorted.rg.bam

line=$id_L001_002
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar AddOrReplaceReadGroups INPUT=$line.sorted.bam OUTPUT=$line.sorted.rg.bam RGID=$line RGLB=WGE_$id RGPL=ILLUMINA RGPU=machine RGSM=$id
samtools index $line.sorted.rg.bam

# 5. Merge two different lanes
id1=$id_L002_001
id2=$id_L002_002
id3=$id_L001_001
id4=$id_L001_002
#id=0199-1-B
samtools merge $id.merge.bam $id1.sorted.rg.bam $id2.sorted.rg.bam $id3.sorted.rg.bam $id4.sorted.rg.bam
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar SortSam INPUT=$id.merge.bam OUTPUT=$id.merge.sorted.bam SORT_ORDER=coordinate
samtools index $id.merge.sorted.bam

# 6. Dedup
#id=0199-1-B
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar MarkDuplicates INPUT=$id.merge.sorted.bam OUTPUT=$id.merge.sorted.dedup.bam  METRICS_FILE=$id.merge.sorted.dedup.metrics.txt PROGRAM_RECORD_ID= MarkDuplicates PROGRAM_GROUP_VERSION=null PROGRAM_GROUP_NAME=MarkDuplicates
java -jar /nas/longleaf/apps/picard/2.10.3/picard-2.10.3/picard.jar BuildBamIndex INPUT=$id.merge.sorted.dedup.bam

## remove the temp files
rm -rf $id.merge.sorted.bam
rm -rf $id.merge.sorted.bam.bai

#7. Realign
#id=0199-1-B
gatk -T RealignerTargetCreator -R ucsc.hg19.fasta -I $id.merge.sorted.dedup.bam -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o $id.merge.sorted.dedup.target_intervals.list
gatk -T IndelRealigner -R ucsc.hg19.fasta -I $id.merge.sorted.dedup.bam -targetIntervals $id.merge.sorted.dedup.target_intervals.list -known Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -known 1000G_phase1.indels.hg19.sites.vcf -o $id.merge.sorted.dedup.realigned.bam

# 8. Recalibrate
#id=0199-1-B
gatk -T BaseRecalibrator -R ucsc.hg19.fasta -I $id.merge.sorted.dedup.realigned.bam -knownSites dbsnp_138.hg19.vcf -knownSites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -knownSites 1000G_phase1.indels.hg19.sites.vcf -o $id.merge.sorted.dedup.realigned.recal_data.table

gatk -T PrintReads -R ucsc.hg19.fasta -I $id.merge.sorted.dedup.realigned.bam -BQSR $id.merge.sorted.dedup.realigned.recal_data.table -o $id.merge.sorted.dedup.realigned.recal.bam
samtools index $id.merge.sorted.dedup.realigned.recal.bam

## 9. GATK HaplotypeCaller
#id=0199-1-B
gatk -R ucsc.hg19.fasta -T HaplotypeCaller -I $id.merge.sorted.dedup.realigned.recal.bam --dbsnp dbsnp_138.hg19.vcf --emitRefConfidence GVCF -o raw_variant.$id.g.vcf
ADD REPLY
0
Entering edit mode
6.9 years ago
 --emitRefConfidence GVCF

You're creating a GVCF, that is not a VCF (https://gatkforums.broadinstitute.org/gatk/discussion/4017/), you need to call GenotypeGVFs

ADD COMMENT
0
Entering edit mode

I actually perform joint calling:

gatk -T GenotypeGVCFs -R ucsc.hg19.fasta -V raw_variant.0107-2-1.g.vcf -V raw_variant.0107-2-A.g.vcf -V raw_variant.0107-2-B.g.vcf ...

and then

##### Recalibrate variant quality scores = run VQSR

### 1. Prepare recalibration parameters for SNPs

### 2. Build the SNP recalibration model

gatk -T VariantRecalibrator -R ucsc.hg19.fasta -input variant_raw_joint.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.hg19.sites.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.hg19.sites.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -rscriptFile recalibrate_SNP_plots.R

### 3. Apply the desired level of recalibration to the SNPs in the call set

gatk -T ApplyRecalibration -R ucsc.hg19.fasta -input variant_raw_joint.vcf -mode SNP --ts_filter_level 99.5 -recalFile recalibrate_SNP.recal -tranchesFile recalibrate_SNP.tranches -o recalibrated_snps_raw_indels.vcf

### 4. Prepare recalibration parameters for Indels

### 5. Build the Indel recalibration model

gatk -T VariantRecalibrator -R ucsc.hg19.fasta -input recalibrated_snps_raw_indels.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.hg19.vcf -an QD -an FS -an SOR -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -rscriptFile recalibrate_INDEL_plots.R

#### 6. Apply the desired level of recalibration to the Indels in the call set

gatk -T ApplyRecalibration -R ucsc.hg19.fasta -input recalibrated_snps_raw_indels.vcf -mode INDEL --ts_filter_level 99.0 -recalFile recalibrate_INDEL.recal -tranchesFile recalibrate_INDEL.tranches -o recalibrated_variants.vcf

ADD REPLY

Login before adding your answer.

Traffic: 1996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6