Hello!
My question is related to the coverage of reads at two steps first, after mapping and second post-alignment processing. The main aim of my project is to identify the somatic mutation. For this, I would think that the coverage of the post-alignment process should be good. My variant calling pipeline is very standard with default parameters and LENIENT stringency criteria that consist of the following steps:
- trimming (trimmomatic -0.35)
- Alignment (bowtie2)
- AddOrReplaceReadGroups, MarkDuplicates (Picard)
- RealignerTargetCreator, IndelRealigner (GATK)
- FixMateInformation (Picard)
- HaplotypeCaller (GATK)
After step 2, I found that there are approx 1000 reads for the exome of interest. however, after step 3,4 and 5, we found less than 10 reads. I am not sure for the reasons, but we tracked the decrease in the read number after mark duplicate step.
Is there any paper/discussion to suggest the what should be the minimum and maximum required/acceptable reads to identify somatic mutations?
Is my pipeline creating this problem? Will using samtools + VarScan2 be a better solution without step 3 and 4? As I think the number of reads to identify somatic mutation is very low and because of that in VCF annotation, we are unable to identify the known mutations.
Any help suggestion is appreciated.
Thanks!
Yes, we are using Agilent HaloPlex - Amplicons. I did not look for the pipeline according to the sequencing data. After going through initial search, my guess is that I can skip step 3-5 and after alignment step I can perform variant calling. Is this correct way to proceed? Is there a paper or discussion that I can find more information and reasoning?
Is there a better method than the one mentioned above i should implement in my pipeline?
Use SureCall from Agilent, it's probably easiest. I've no idea whether Haplotype Caller is good for you, certainly a couple of years ago there was no real support/documentation for amplicon data with GATK. If you're looking for somatic mutations, you should probably be using MuTect anyway if you're wedded to a Broad approach.
You do not deduplicate amplicon data - even though Haloplex is a hybrid, it still has significant PCR steps in. If you visualise your BAM file after alignment then you will see that the reads stack up like a city skyline - these are duplicates - they all have the same start and stop position because they're generated by PCR. If you remove duplicates - well you have seen what happens. Removing duplicates is only suitable for situations where you've randomly sheared the DNA and are using (e.g.) hybridisation enrichment.