I used Mutect for my normal/tumor pair samples(exome) with the bam files which are preprocessed by GATK for BQSR and those recalibrated bam files are used for Mutect downstream for somatic mutation identification. To note during processing of bam in GATK I have already used target bed files provided by the company which are used for target enrichment during the library preparation process. Still I wanted to see if I use those target bed files again with Mutect do I get mutations with better annotation on exomes or not. When I run mutect without the --intervals
option i.e not specifying the Target enrichment bed files I get variants which around 1800 which are having the flag KEEP but when use the option --intervals
with MuTect 1.1.4 with I do not get any variants with flag KEEP and COVERED. Is it not a right way to use the target bed file of the company while running the MuTect on GATK processed bam files? The command line I used is below
java -Xmx14g -jar /scratch/GT/softwares/mutect/muTect-1.1.4.jar \
--analysis_type MuTect \
--reference_sequence /scratch/GT/vdas/test_exome/exome/hg19.fa \
--cosmic /data/PGP/exome/mutect/hg19/hg19_cosmic_v54_120711.vcf \
--dbsnp /scratch/GT/vdas/test_exome/exome/databases/dbsnp_137.hg19.vcf \
--input_file:normal /scratch/GT/vdas/pietro/exome_seq/results/N_S8981/N_S8981.realigned.recal.bam \
--input_file:tumor /scratch/GT/vdas/pietro/exome_seq/results/T_S7999/T_S7999.realigned.recal.bam \
--out /scratch/GT/vdas/pietro/exome_seq/results/mutect/param_test/mutect_S_333soma_t.txt \
--coverage_file /scratch/GT/vdas/pietro/exome_seq/results/mutect/param_test/LG.coverage.wig.txt \
--vcf /scratch/GT/vdas/pietro/exome_seq/results/mutect/param_test/mutect_S_333soma_t.vcf \
--intervals /scratch/GT/vdas/referenceBed/hg19/ss_v4/SureSelect_XT_Human_All_Exon_V4.bed \
--fraction_contamination 0.5
I would like to know if anyone has tried this or not and the feedback. Is it advisable to use the target bed files with --interval
options of GATK recalibrated bam files while running MuTect and fishing out somatic mutations for the pair?
It seems that using the interval file is not the problem, I had re run the analysis again without using the interval list of bed file that comes with the target kit, but I kept the tumor contamination fraction as 50% or 0.5 . I get only 2 somatic variants. I have also checked the scenario with VarScan at both 25% and 50% purity and I found very few variants which were high confident. I retrieved around 180 odd variants with VarScan out of which only 8-9 were on the exons. This might be an indication that the variants might not be ideally true. But I would like to have some input from you guys who already tried exome data analysis for somatic variants with less than or equal to 50% pure line for normal/tumor pairs for fishing out mutations(somatic). Please give your views.
Hi vchris_ngs,
Your input file
realignment of the bam file is the normal exome sequence pipeline procedure, it is the realignment around the indels, I am using the realigned and recalibrated bam files and calling variants on them with Mutect, VarScan and GATK, so am keeping the source bam file same from which I am making the variant calls.
If you could share the links where the procedure is defined to get realignment around indels, that would be great.
I normally used sorted BAM for varscan2/somaticSnipper and works fine. I wanted to run muTect on the same BAM and it is throwing error.
You can see the error message which I posted here.
@Chirag Nepal
I used the below command with Mutect version 1.1.4
Using local realignment on the sorted bam files which are cleaned after marking or removing duplicates with Picard tool (please read the documentation of Picard)
Local realignment around indels
Step1:
supported extensions (.bed, .list, .picard, .interval_list, or .intervals)
This step puts the table in the file in
input.bam.list
. When this is finished we can start the realigning step using the statements below:(medium time run)Step2: Quality score recalibration
That's still not all. Quality data generated from the sequencer isn't always very accurate and for obtaining good SNP calls (which rely on base quality scores), recalibration of these scores is necessary. Again this is done in two steps: the CountCovariates step and the PrintReads steps. Both can be run from the GATK package:
1) Count covariates:
This step creates a .csv file which is needed for the next step and requires a dbSNP file, which can be downloaded at the UCSC Genome browser homepage'(http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/). DbSNP137 is the most novel one which can be downloaded from the UCSC browser, but dbSNP is updated regularly, so newer versions will be available in the future.
2) Table recalibration:
For my analysis I created this bam files individually both for the normal and the tumor samples, this bam file can be then used to run with Mutect (in my case I used the version 1.1.4), the GATK which I have used is 2.3.4, but I would recommend you to use the latest version of the suite.
hi @vchris_ngs,
Thanks for your detail answer.
I got into this error:
I used bwa to map reads, and converted/sort sam to bam.
Then used this command:
Can you help me with this?
Could it be that you forgot to set the read groups at bwa alignment?
You can use Picard AddOrReplaceReadGroups take a look here
I did forget to set read groups at bwa alignments.
ReadGroup infor are not present in fastq files. As I am working on downloaded reads from SRA, I have no information needed to put for Read group in Picardtools. I am trying to bypass this by dummy values, but not yet successful.