I used bwa mem to perform a paired alignment of two fastq files. The resulting bam file would have been used to generate an mpileup and finally a consensus sequence. I noticed that for this alignment, many paired and seemingly properly mated pairs were not given the proper pair flag on the resultant sam file. Would anyone know why something like this might happen and is there a way to fix this?
The following alignments illustrate the problem. Both reads are paired, and each individually map onto the exact same location on the reference, with the exact same cigar string. However, neither read has the properly paired flag (2)
M02670:237:000000000-CWY49:1:1101:18506:10667 97 H3N2_NA 1 60 27S235M = 1 235 GCGTGATCAGCAAAAGCAGGAGTAAAGATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCCTGATGACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATGATAGAAAGAAACATAACAGAGATAGTGTATTTGACCAACACCACCATAGAGAGGGAAATATGCC CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGFFGFGGGGFGFFFADEFFFFFFFFFF NM:i:6 MD:Z:47A41A50T29A52A4G6MC:Z:27S235M AS:i:217 XS:i:0 RG:Z:27-A_Macedonia_648_2020_S27
M02670:237:000000000-CWY49:1:1101:18506:10667 145 H3N2_NA 1 60 27S235M = 1 -235 GCGTGATCAGCAAAAGCAGGAGTAAAGATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCCTGATGACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATGATAGAAAGAAACATAACAGAGATAGTGTATTTGACCAACACCACCATAGAGAGGGAAATATGCC ?FFFFFD>FFFFFFF?BA:FFFFFFFFFFFFFFFFFFFDFFFFFFBFFFFFFFFFFFGF>DGGGGGGGGGGGGGFGGF>GFGGGGDFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC NM:i:6 MD:Z:47A41A50T29A52A4G6MC:Z:27S235M AS:i:217 XS:i:0 RG:Z:27-A_Macedonia_648_2020_S27
I am aware that in the generation of the mpileup, the -A option can be used to ignore this flag, but I am worried about resultant false positives from doing this. Furthermore, I would like a more general solution as this is alignment was output as part of a processing pipeline that processes many more samples.
samtools fixmate
?Not sure what the right answer is here though, both reads have the same sequence, both have 27 soft clipped bases, the actual fragment length is not 235 for that reason.
In general the precise rationale of what is called proper pair is not well defined. It is a "proper pair" only according to bwa and for reasons that are never explained. Of course, that knowledge is of little help if the next tool makes use of proper pair information.
I would recommend avoiding the use of this flag, it is an outdated, obsolete and plain wrong concept rooted in a time where sequencing was terrible and one needed crutches like this to move ahead.
Instead you should filter your bam file for alignments that fulfill certain properties, primary alignment, above a certain quality etc then use the all flag. Otherewise the many hidden assumptions that mpileup makes will never be clear.
also likely that this problem manifests itself because the reads are completely identical and overlapping
I don't think they are identical, because one is forward and one is reverse. Looks like the read length > fragment length.
what I meant completely overlapping (the fragment is shorter than the read lenght) hence both reads in the pair align to the exact same interval on different strands.
upon more discussion on slack we believe that the cause of not having the proper pair set is that in this case bwa can't quite decide whether the reads are in the correct orientation
--> <--
or in<-- -->
hence it won't mark them as proper pair.the simplest workaround is to use a different aligner like
bowtie2
orbbmap
perhaps
samtools fixmate
also works - would be good to know.Hello,
any new insights on this one? Cause I have exactly the same problem here. I have paired-end data sequenced on a HighSeq. After running flagstat I saw that bwa did not set the flag "properly paired" to my reads. I checked if this is true and could see that they (from my definition) should be properly paired. The direction is right either F1R2 or F2R1 and the reads are mapped to the same location. Also my reads are overlapping and a lot of them are completely identical. Only preprocessing I did was quality and adapter trimming with bbduk in paired-end mode. The pairs have the proper order in the fastq files, I checked that.
I tried samtools fixmate - did not work. I can unfortunately not work with another aligner, cause I'm using it as part of a pipeline for a specific RNA-seq experiment.
How did that happen? That seems to indicate that this sample may have been subject to too many cycles of PCR leading to these duplicates. Are your reads identical to length of sequencing? Or are these amplicons that are supposed to be like this?
Since you have BBMap installed I suggest that you try aligning with
bbmap.sh
the aligner instead.Adjust
maxindel
to a reasonable value if this is RNAseq data (size of introns). If not you can remove that and use default.Ah maybe this was a misunderstanding. I meant that the pairs are overlapping and the sequence is identical (besides the fact that one of it is the reverse complement) ;) Sorry for that. I guess the main reason that they overlap is actually the nature of the experiment. The fragments should be relatively short. I don't have the bioanalyzer sheet at the moment, but my colleague will send it to me. But I guess the fragments are actually shorter than the sequencing read length. It's an IP to analyse RNA-RNA interactions and the pulled fragments are RNAse digested to chop them a little. Paired-end in our case is needed cause you deal with chimeric fragments due to ligation of two RNA transcripts.
I see. Try aligning with bbmap and let us know if that helps. You could pre-merge the PE reads since they overlap using
bbmerge.sh
but I am not sure if that would affect your downstream analysis.