Hi all, i am looking for paired end Illumina FASTA files from whole genome NGS run. I am developing some pipelines and was using a BAM file from 1000 genomes. After I run the GATK Best Practices pipeline I am generating a VCF with no data at all. Basically, i am converting BAM to FASTQ, then aligning to hg38 and then other steps as recommended. I am wondering what is so incorrect that no variants are detected?
Before I post questions about the pipeline itself, I wanted to be sure to use an input file which has been recommended by someone in this group.
Any help here is much appreciated and if there was an accompanying VCF that would be awesome.
##Align with BWA-mem.
#sudo ./algorithms/bwa/bwa mem -M references/hg38gatkbundle/Homo_sapiens_assembly38.fasta data/HG100/HG100.1.exome.fastq data/HG100/HG100.2.exome.fastq > data/HG100/HG100.aligned.reads.sam
##Sort the SAM file and it will also convert to BAM
#sudo java -jar algorithms/picard/picard.jar SortSam INPUT=data/HG100.aligned.reads.sam OUTPUT=data/HG100.sorted.reads.bam SORT_ORDER=coordinate
##Reheader the BAM. This is required for GATK.
#sudo java -Xmx8g -jar algorithms/picard/picard.jar AddOrReplaceReadGroups I=data/HG100/HG100.sorted.reads.bam O=data/HG100/HG100.sorted.reheader.reads.bam RGID=4 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=20
##Build BAM index
#sudo java -Xmx8g -jar algorithms/picard/picard.jar BuildBamIndex INPUT=data/HG100/HG100.sorted.reheader.reads.bam
##Perform local realignment
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T RealignerTargetCreator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -known references/hg38gatkbundle/Mills_and_1000G_gold_standard.indels.hg38.vcf -L references/hg38gatkbundle/wgs_calling_regions.hg38.interval_list -I data/HG100/HG100.sorted.reheader.reads.bam -o data/HG100/HG100.realignment.targets.list
##Perform local realignment of reads around indels
#sudo java -jar -Xmx8g algorithms/gatk3/gatk3.8.jar -T IndelRealigner -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -I data/HG100.sorted.reheader.reads.bam -known references/hg38gatkbundle/Mills_and_1000G_gold_standard.indels.hg38.vcf -known references/hg38gatkbundle/dbsnp_138.hg38.vcf -known references/hg38gatkbundle/Homo_sapiens_assembly38.known_indels.vcf -targetIntervals data/HG100.realignment.targets.list -o data/HG100/HG100.local.realigned.bam
##Perform Base Recalibration
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T BaseRecalibrator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -I data/HG15/HG15.local.realigned.bam -knownSites references/hg38gatkbundle/dbsnp_138.hg38.vcf -o data/HG15/HG15.recal.data.table
##Run PrintReads
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T PrintReads -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -I data/HG15/HG15.local.realigned.bam -o data/HG15/HG15.output.bam --read_filter MappingQualityZero
##Run Unified Genotyper
#Use the following command if you also need to generate a gVCF file.
#sudo java -Xmx8g -jar algorithms/gatk3/gatk3.8.jar -T UnifiedGenotyper -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta --output_mode EMIT_ALL_SITES -I data/HG100.output.bam --dbsnp references/hg38gatkbundle/dbsnp_138.hg38.vcf -o data/HG100.output.raw.snps.indels.g.vcf
##Use the following command to generate a VCF file
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T UnifiedGenotyper -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta --output_mode EMIT_VARIANTS_ONLY -I data/HG15/HG15.output.bam --dbsnp references/hg38gatkbundle/dbsnp_138.hg38.vcf -o data/HG15/HG15.output.raw.snps.indels.vcf
#### May have to use this one
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T UnifiedGenotyper -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -A BaseQualityRankSumTest -A Coverage -A QualByDepth -A FisherStrand -A ReadPosRankSumTest --output_mode EMIT_VARIANTS_ONLY -I data/HG15/HG15.output.bam --dbsnp references/hg38gatkbundle/Homo_sapiens_assembly38.dbsnp138.vcf -o data/HG15/HG15.output.raw.combined.vcf
###
## Run Variant Recalibrator for SNP
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T VariantRecalibrator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG15/HG15.output.raw.snps.indels.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 references/hg38gatkbundle/hapmap_3.3.hg38.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 references/hg38gatkbundle/1000G_omni2.5.hg38.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 references/hg38gatkbundle/1000G_phase1.snps.high_confidence.hg38.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 references/hg38gatkbundle/dbsnp_138.hg38.vcf -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an InbreedingCoeff -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -recalFile data/HG15/HG15.recalibrate.SNP.recal -tranchesFile data/HG15/HG15.recalibrate.SNP.tranches -rscriptFile data/HG15/HG15.recalibrate.SNP.plots.R
## Apply Variant Recalibration for SNPs
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T ApplyRecalibration -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG15/HG15.output.raw.snps.indels.vcf -mode SNP --ts_filter_level 99.0 -recalFile data/HG15/HG15.recalibrate.SNP.recal -tranchesFile data/HG15/HG15.recalibrate.SNP.tranches -o data/HG15/HG15.recalibrated.SNP.raw.indels.vcf
## Run Variant Recalibrator for INDEL
## ?? Need to check the QD problem here, for now we have eliminated the QD annotation flag.
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T VariantRecalibrator -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG100.recalibrated.SNP.raw.indels.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 references/hg38gatkbundle/Mills_and_1000G_gold_standard.indels.hg38.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 references/hg38gatkbundle/dbsnp_138.hg38.vcf -an DP -an FS -an SOR -an MQRankSum -an ReadPosRankSum -an InbreedingCoeff -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 -recalFile data/HG100.recalibrate_INDEL.recal -tranchesFile data/HG100.recalibrate_INDEL.tranches -rscriptFile data/HG100.recalibrate_INDEL_plots.R
## Apply Variant Recalibrator
#sudo java -Xmx8g -jar algorithms/gatk3/gatk.jar -T ApplyRecalibration -R references/hg38gatkbundle/Homo_sapiens_assembly38.fasta -input data/HG100.recalibrated_snps_raw_indels.vcf -mode INDEL --ts_filter_level 99.0 -recalFile data/HG100.recalibrate_INDEL.recal -tranchesFile data/HG100.recalibrate_INDEL.tranches -o data/HG100.recalibrated_variants.vcf
Not what you are looking for, but you shouldn't use
sudo
to perform tasks like this. Sudo should only be used when you absolutely have to, for "sysadmin tasks".It is highly recommended to use
HaplotypeCaller
instead ofUnifiedGenotyper
more
I am not quite sure what you were running there, but could it be that the parameters of the pipeline are for use with multiple samples, not just one?
Please be more specific what you need. Tumor-only or matched normal or normal only? There are a couple of standard datasets that include validated mutations. Better start with these, e.g. NA12878. Still, please provide some details on your pipeline. Best would be to provide the exact commands you used and the alignment statistics. No results at all is suspicious. Every WGS sample, also healthy ones, contains thousands of germline mutations. There must be a bug somewhere in your pipeline.
OK! So I am looking for germline mutation so no need for tumor vs. sample. I am posting the entire pipeline just now.
Not really answering your question but maybe useful tips... To add Read Group to your bam files, run
bwa mem
with the-R STR
options. This way you can eliminate the AddOrReplaceReadGroup step. Also, consider piping the output ofbwa mem
tosamtools sort
so you get the sorted bam file without passing by SAM. Even better, you could pipe to bamsort in biobambam so you can also index and mark duplicates on the fly (which probably you need to do). (I can't comment on the rest as I'm not familiar with germline variant calling).Thanks, will implement some of the steps you have outlined.