Hi, I am trying to do indel realignment using the GATK tool IndelRealigner. I did my alignment using bwa mem with the -M command for alternative alignments. When I go to use GATK IndelRealigner it removes all of my reads that have the potential to have secondary alignments even if the secondary alignment score is low. Everything that I have read says to align with secondary alignments kept before going into GATK, but if I do this then 95-98% of my reads are removed when performing the IndelRealigner step. I have tried the --filter_bases_not_stored, -rf NotPrimaryAlignment and still get 95-98% of my reads removed. Any input would be appreciated. Thanks
So, if I put -rf NotPrimaryAlignment they fail because of this filter even though I have primary alignments for the read. If I put the -filter_bases_not_stored, they fail because of MalformedRead. When I check the headers using picard's ValidateSamFile, I do not get any errors returned. This is why I am stumped.
And yes, samtools flagstat gives a high amount of secondary alignments
I decided to attempt the HaplotypeCaller instead to see if it would give the same error. Rather, it gave me failing for the MappingQuality. After looking at my bam file, most of my alignments have low MAPQ- mostly 8 or even 0 for primary alignments. I blasted a few sequences in ncbi and ma having mixed feelings about the MAPQ values- some with values of 0 mapped perfectly to the soybean genome without any filtering for organisms (i.e. blasting aganist the entire NCBI database), while others with high MAPQ (32) aligned in two portions and not in one piece. I've done alignments before, but always with TopHat and never encountered any problems like this. From my understanding, and please correct me it I am wrong, there is no need to change the MAPQ values from bwa alignments (unlike with other alignment programs), so why do I have such low MAPQ values?
Are you mapping to a highly repetitive genome? BWA assigns MAPQ=0 to reads mapping to multiple locations, as they are anyway (almost) useless for variant calling. Did you see more than one equally plausible hits in Blast for reads with MPAQ=0? On the other hand, I find your results very unusual. Something is fundamentally wrong. Why don't you try mapping with TopHat just for comparison. What params are you using for BWA? What is your whole commandline? Lastly, bear in mind that GATK pipeline is highly optimized for human. Non-humans, especially multiploid organisms can show some unsuspecting surprises. You should also consult GATK forums, there is plethora of materials and usually they are swift in reply.
Is it possible that your genome is not the right genome, due to which BWA is forced to split the reads?
I am working with soybean, so it does have repetitive regions and is polyploid. My bwa command is as follows: bwa mem -r 1 -a -M -R (RG link) genome input.bam
My secondary hits are marked as such (flag 256), but it appears that GATK is not accepting any reads with duplicates. I am also finding that even reads with a MAPQ of 60 are getting secondary alignments.
As for the GATK pipeline, I was using sambamba markdup -r to remove duplicates, samtools index to index my bam files, and the above mentioned commands for the downstream processes of GATK.
I am simply beginning to suspect that GATK will not work for soybean. Too bad, I was appreciative of how it realigns indels to weed out any uncertainty. I'll find/create another similar script. Thank-you for your time