I have tried to document general workflow and potential errors you might come at each step. All these steps are pre-requisite if you want to run GATK/muTect, while not necessary for other SNV tools (Varscan2, SomaticSnipper, Shimmer). After you perform these steps, one can input BAM files into various SNV tools such as Varscan2, muTect, SomaticSnipper, Shimmer. There are many other tools, but these are the tools I used.
Step 1
Map Pair-end FASTQ files using BWA. While mapping the reads, it might be possible the Read Group (RG) information may not be provided. If you want to use GATK, this is compulsory. RG can be inserted in bwa command with -r '@RG\tID:foo\tSM:bar'
, or you could insert during pre-processing of BAM files using PicardTools. "I had already mapped BAM files, hence I inserted RG info using PicardTools. I used dummy values in RG files, so as to bypass the error."
Step 2
If you have multiple files, loop all the file names.
for name in Sample_ST25_sorted.bam Sample_ST26_sorted.bam
do
output1=${name%_sorted.bam}_Filtered.bam
output2=${name%_sorted.bam}_Filtered_SortedFixed.bam
output3=${name%_sorted.bam}_Filtered_SortedFixed_Reorder.bam
output4=${name%_sorted.bam}_Filtered_SortedFixed_Reorder_Clean.bam
output5=${name%_sorted.bam}_Filtered_SortedFixed_Reorder_Clean_MarkDup.bam
output6=${name%_sorted.bam}_Filtered_SortedFixed_Reorder_Clean_MarkDup_AddRG.bam
output7=${name%_sorted.bam}_IndelRealigned.bam
intervalList=${name%_sorted.bam}_Interval.list
Step 2.1
Picard does not accept the assigned flag (from BWA) for unmapped reads SAM/BAM from BWA. I fixed simply by removing the unmapped reads, as I am focused only on unique mapping reads. However, you could also change the flag and proceed.
For detail discussions see this post on Biostars-: Final Solution For "Mapq Should Be 0 For Unmapped Read."
samtools view -bF 4 -q 1 $name > $RealignedBamDir/$output1
Step 2.2:
Fix the reads and sort the bam files based on their coordinates. Even though I had sorted file from samtools, it gave me errors while running on Picard. So it is recommended to use Picard to sort BAM files. It is good to provide the option TMP_DIR
, instead of using the system temp_dir, since you might an error for large files. Rule of thumb: Always use VALIDATION_STRINGENCY=STRICT
.
java -Xmx20g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=`pwd`/tmpA25 -jar $picard/FixMateInformation.jar \
I=$RealignedBamDir/$output1 \
O=$RealignedBamDir/$output2 \
MAX_RECORDS_IN_RAM=2000000 \
TMP_DIR=`pwd`/tmpA25 \
SO=coordinate VALIDATION_STRINGENCY=STRICT
Step 2.3
Reorder bam files in the order of fasta
java -Xmx8g -XX:ParallelGCThreads=8 -Djava.io.tmpdir=`pwd`/tmpA25A -jar $picard/ReorderSam.jar \
I=$RealignedBamDir/$output2 \
O=$RealignedBamDir/$output3 \
R=$Reference \
TMP_DIR=`pwd`/tmpA25A \
VALIDATION_STRINGENCY=STRICT
Step-2.4
You might encounter this error "CIGAR M operator maps off end of reference". If such error appears, Clean BAM. These errors are generally found when certain reads map to the end of chromosomes and beyond, or if the reference assembly used is different. For eg, see this post Should all the HG19 reference be the same? to see how the UCSC assembly and Ensembl assembly differ in chrM.
java -Xmx8g -XX:ParallelGCThreads=8 -jar $picard/CleanSam.jar \
I=$RealignedBamDir/$output3 \
O=$RealignedBamDir/$output4
Step 2.5
Mark duplicates
java -Xmx8g -XX:ParallelGCThreads=8 -jar $picard/MarkDuplicates.jar \
I=$RealignedBamDir/$output4 \
O=$RealignedBamDir/$output5 \
M=$RealignedBamDir/File$name \
VALIDATION_STRINGENCY=STRICT
Step 2.6
Add ReadGroup (RG). If RG info are provided during mapping, this step can be excluded. Here, I have stored dummy variables.
java -Xmx8g -XX:ParallelGCThreads=8 -jar $picard/AddOrReplaceReadGroups.jar \
I=$RealignedBamDir/$output5 \
O=$RealignedBamDir/$output6 \
RGID=group1 RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=sample1
Step 2.7
Create BAM index
java -Xmx8g -XX:ParallelGCThreads=8 -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/$output6
Step 2.8
Create List of all INDELs
java -Xmx8g -XX:ParallelGCThreads=8 -jar $GATK/GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R $Reference \
-I $RealignedBamDir/$output6 \
-known $Indel1000G \
-o $RealignedBamDir/$intervalList
Step 2.9
Make new realigned BAM files
java -Xmx8g -XX:ParallelGCThreads=8 -jar $GATK/GenomeAnalysisTK.jar \
-T IndelRealigner \
-R $Reference \
-I $RealignedBamDir/$output6 \
-targetIntervals $RealignedBamDir/$intervalList \
-known $Indel1000G \
-o $RealignedBamDir/$output7
Step 2.10
Create BAM index of realigned BAM files
java -Xmx8g -XX:ParallelGCThreads=8 -jar $picard/BuildBamIndex.jar I=$RealignedBamDir/$output7
done
Step 3
Now the BAM files can be feed to any SNV tools, listed above or other existing tools of your choice. If you further want to re-process your data, you could add recalibration step from GATK and have the final recalibrated BAM. If you do the recalibration step, it is always nice to compare the SNV list, before and after recalibration.</h3>
Step 4
Run SNV tools on each BAM files.
The overlap (somatic mutations) between various SNV tools are notoriously low (as it seems from my experience and few published articles), I would recommend using at least two tools.
Step 4.1
Run muTect. You might encounter an error with newer version of Java. For some unknown reasons, I could run muTect only with older version, i.e, Java6.
/usr/lib/jvm/java-6-openjdk-amd64/bin/java -jar $muTect \
--analysis_type MuTect \
--reference_sequence $sequence \
--cosmic $cosmicData \
--dbsnp $dbsnp138 \
--input_file:normal $P1N \
--input_file:tumor $P1T \
--out P1Out \
--coverage_file P1.wig
Now, you can filter the output by using awk command:
cat P1Out | grep -v REJECT > P1Out.filtered
Step 4.2 Run Varscan2
Filter high confidence prediction
Step 4.3 Run Somatic Sniper
Filter high-confidence prediction
Step-5
Merge the somatic calls for all tools and see how much overlap between these tools. Do the downstream analysis.
it's rather de-moralizing that it requires this many steps to get a usable bam. that's one reason to prefer samtools or freebayes (or platypus) variant callers.
I don't think using alternative variant callers really has much impact on pre-processing steps. And some of them can be easily combined or skipped except cases where they fail. But marking duplicate reads, ensuring rads are sorted in the right order, etc are all necessary (or at least highly recommended) regardless of what variant caller you ultimately want to use. I've seen some suggestions where FreeBayes and Platypus weren't as sensitive to needing to do local realignment as UnfiedGenotyper or HaplotypeCaller but it was still recommended.
Don't forget to add
-M
argument to bwa-mem, if you are using picard as downstream tool.