Hi,
I have a paired end DNA sequence alignment BAM file (got from using BWA -M) and would like to know if the points mentioned below is a good practice..
1) Since my aim is to perform variant calling, is it OK to keep unmapped, secondary,singleton and duplicated marked reads in the BAM file itself because by default they won't be considered for variant calling (bcftools mpileup and call)..OR..remove them to make the BAM file smaller to save space.
2) Since I am only interested in certain genes in the genome, I am planning to chop my BAM file to get only the reads aligned to my gene of interest (and 10000 bp upstream and downstream) and then perform variant calls on them. I can then perform variant calling on all my candidate genes of interest in parallel.
Can anyone let me know any caveat in the above mentioned approaches?
Excellent point made by @WouterDeCoster. I'll add one: GATK calls variants by doing local re-assembly of reads surrounding the variant sites, and sometimes it would create problems around the boundaries of variant regions used for this assembly, especially if indels are involved. I've encountered such problems and the advice I got from GATK support team is to try calling variants for the entire chromosomes as much as possible if computing resource is not a problem.
Ok..thanks for letting me know this. The coordinate screw ups do make sense..and bcftools mpileup also has the option to provide specific coordinates to perform selective variant calling..