Question

Alignments not labelled as proper pair on bwa mem

0

Entering edit mode

5.5 years ago

lastenigma1805 • 0

I used bwa mem to perform a paired alignment of two fastq files. The resulting bam file would have been used to generate an mpileup and finally a consensus sequence. I noticed that for this alignment, many paired and seemingly properly mated pairs were not given the proper pair flag on the resultant sam file. Would anyone know why something like this might happen and is there a way to fix this?

The following alignments illustrate the problem. Both reads are paired, and each individually map onto the exact same location on the reference, with the exact same cigar string. However, neither read has the properly paired flag (2)

M02670:237:000000000-CWY49:1:1101:18506:10667   97  H3N2_NA 1   60  27S235M =   1   235 GCGTGATCAGCAAAAGCAGGAGTAAAGATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCCTGATGACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATGATAGAAAGAAACATAACAGAGATAGTGTATTTGACCAACACCACCATAGAGAGGGAAATATGCC  CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGFFGFGGGGFGFFFADEFFFFFFFFFF  NM:i:6  MD:Z:47A41A50T29A52A4G6MC:Z:27S235M AS:i:217    XS:i:0  RG:Z:27-A_Macedonia_648_2020_S27
M02670:237:000000000-CWY49:1:1101:18506:10667   145 H3N2_NA 1   60  27S235M =   1   -235    GCGTGATCAGCAAAAGCAGGAGTAAAGATGAATCCAAATCAAAAGATAATAACGATTGGCTCTGTTTCTCTCACCATTTCCACAATATGCTTCTTCATGCAAATTGCCATCCTGATGACTACTGTAACATTGCATTTCAAGCAATATGAATTCAACTCCCCCCCAAACAACCAAGTGATGCTGTGTGAACCAACAATGATAGAAAGAAACATAACAGAGATAGTGTATTTGACCAACACCACCATAGAGAGGGAAATATGCC  ?FFFFFD>FFFFFFF?BA:FFFFFFFFFFFFFFFFFFFDFFFFFFBFFFFFFFFFFFGF>DGGGGGGGGGGGGGFGGF>GFGGGGDFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC  NM:i:6  MD:Z:47A41A50T29A52A4G6MC:Z:27S235M AS:i:217    XS:i:0  RG:Z:27-A_Macedonia_648_2020_S27

I am aware that in the generation of the mpileup, the -A option can be used to ignore this flag, but I am worried about resultant false positives from doing this. Furthermore, I would like a more general solution as this is alignment was output as part of a processing pipeline that processes many more samples.

software error • 5.8k views

ADD COMMENT • link updated 2.3 years ago by Feng Tian ▴ 20 • written 5.5 years ago by lastenigma1805 • 0

0

Entering edit mode

samtools fixmate?

ADD REPLY • link 5.5 years ago by GenoMax 154k

0

Entering edit mode

Not sure what the right answer is here though, both reads have the same sequence, both have 27 soft clipped bases, the actual fragment length is not 235 for that reason.

In general the precise rationale of what is called proper pair is not well defined. It is a "proper pair" only according to bwa and for reasons that are never explained. Of course, that knowledge is of little help if the next tool makes use of proper pair information.

I would recommend avoiding the use of this flag, it is an outdated, obsolete and plain wrong concept rooted in a time where sequencing was terrible and one needed crutches like this to move ahead.

Instead you should filter your bam file for alignments that fulfill certain properties, primary alignment, above a certain quality etc then use the all flag. Otherewise the many hidden assumptions that mpileup makes will never be clear.

ADD REPLY • link 5.5 years ago by Istvan Albert 103k

0

Entering edit mode

also likely that this problem manifests itself because the reads are completely identical and overlapping

ADD REPLY • link 5.5 years ago by Istvan Albert 103k

0

Entering edit mode

I don't think they are identical, because one is forward and one is reverse. Looks like the read length > fragment length.

ADD REPLY • link 5.5 years ago by swbarnes2 15k

0

Entering edit mode

what I meant completely overlapping (the fragment is shorter than the read lenght) hence both reads in the pair align to the exact same interval on different strands.

upon more discussion on slack we believe that the cause of not having the proper pair set is that in this case bwa can't quite decide whether the reads are in the correct orientation --> <-- or in <-- --> hence it won't mark them as proper pair.

the simplest workaround is to use a different aligner like bowtie2 or bbmap

perhaps samtools fixmate also works - would be good to know.

ADD REPLY • link 5.5 years ago by Istvan Albert 103k

0

Entering edit mode

Hello,

any new insights on this one? Cause I have exactly the same problem here. I have paired-end data sequenced on a HighSeq. After running flagstat I saw that bwa did not set the flag "properly paired" to my reads. I checked if this is true and could see that they (from my definition) should be properly paired. The direction is right either F1R2 or F2R1 and the reads are mapped to the same location. Also my reads are overlapping and a lot of them are completely identical. Only preprocessing I did was quality and adapter trimming with bbduk in paired-end mode. The pairs have the proper order in the fastq files, I checked that.

I tried samtools fixmate - did not work. I can unfortunately not work with another aligner, cause I'm using it as part of a pipeline for a specific RNA-seq experiment.

ADD REPLY • link 4.0 years ago by v1n • 0

0

Entering edit mode

Also my reads are overlapping and a lot of them are completely identical.

How did that happen? That seems to indicate that this sample may have been subject to too many cycles of PCR leading to these duplicates. Are your reads identical to length of sequencing? Or are these amplicons that are supposed to be like this?

Since you have BBMap installed I suggest that you try aligning with bbmap.sh the aligner instead.

# Create an index with your reference genome (Multi-fasta file input, 30G needed for human sized genome adjust down or up as needed, to get BAM file have `samtools` or `sambamba` available in $PATH)
bbmap.sh -Xmx30g ref=genome.fa
# Align your data
bbmap.sh -Xmx30g threads=N (use an appropriate number) in1=R1.fq.gz in2=R2.fq.gz out=file.bam trd=t ambig=random maxindel=1000

Adjust maxindel to a reasonable value if this is RNAseq data (size of introns). If not you can remove that and use default.

ADD REPLY • link 4.0 years ago by GenoMax 154k

0

Entering edit mode

How did that happen? That seems to indicate that this sample may have been subject to too many cycles of PCR leading to these duplicates. Are your reads identical to length of sequencing? Or are these amplicons that are supposed to be like this?

Ah maybe this was a misunderstanding. I meant that the pairs are overlapping and the sequence is identical (besides the fact that one of it is the reverse complement) ;) Sorry for that. I guess the main reason that they overlap is actually the nature of the experiment. The fragments should be relatively short. I don't have the bioanalyzer sheet at the moment, but my colleague will send it to me. But I guess the fragments are actually shorter than the sequencing read length. It's an IP to analyse RNA-RNA interactions and the pulled fragments are RNAse digested to chop them a little. Paired-end in our case is needed cause you deal with chimeric fragments due to ligation of two RNA transcripts.

ADD REPLY • link 4.0 years ago by v1n • 0

0

Entering edit mode

I see. Try aligning with bbmap and let us know if that helps. You could pre-merge the PE reads since they overlap using bbmerge.sh but I am not sure if that would affect your downstream analysis.

ADD REPLY • link 4.0 years ago by GenoMax 154k

score 0 · Answer 1 · 2023-08-06

Please check the insertion size cutoff (min, max) shown in the bwa mem running log.
Then you could find whether 235 (the insertion size in your case) is in the cutoff.
If no, the reads will not be labeled as proper pairs.
If yes, but 235 is exactly the minimum cutoff, then the reads were not labeled as proper pairs because of a 1-bp issue:
The dist variable (the distance between the R1 and R2 starting position) instead of the insertion size was used to compare with the cutoff.
https://github.com/lh3/bwa/blob/master/bwamem_pair.c#L239-L242

Solution:
From:

if (dist > pes[dir].high) break;
if (dist < pes[dir].low)  continue;

To:

if (dist + 1 > pes[dir].high) break;
if (dist + 1 < pes[dir].low)  continue;

Please also see the detail here:
https://github.com/lh3/bwa/issues/397