Question

Reverse complements in BWA

1

Entering edit mode

6.8 years ago

dp ▴ 50

I am aligning paired end reads to contigs from an assembly using BWA and it looks like I get different results if I use the contigs or their reverse complements. Does this make sense? Which result should I use since they are different?

alignment sequence reads • 3.3k views

ADD COMMENT • link 6.8 years ago by dp ▴ 50

1

Entering edit mode

Are you getting these differences only with multimappers? That'd be expected.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks @Devon. Why would that be the case, doesn't the aligner align both the read and it's reverse complement to the contig? If the forward sense read maps in multiple places wouldn't the reverse complement map to the same places?

Also - what is the best way to check this, I've been dumping the alignments to bam and I noticed the difference in a downstream application. When I check the bam file sizes I see a differnce of a few kilobytes, but I'm not sure how to find the specific alignments that are different.

ADD REPLY • link 6.8 years ago by dp ▴ 50

0

Entering edit mode

Of course, but the randomly returned alignment for multimappers wouldn't be the same, so the result you're seeing would be different. There's no point in actually checking this unless what you're seeing is differences in reliable (MAPQ >= 1 at least) alignments, since otherwise you're just seeing the effect of various random seeds.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

You should qualify and quantify what you mean by " different results". Later you talk about differences in BAM size, that's normal and expected. Information presented differently will compress at different rates. Do you see more meaningful differences?

ADD REPLY • link 6.8 years ago by Istvan Albert 102k

0

Entering edit mode

Yes, downstream I am using the alignments of the reads to bridge between contigs in the assembly graph and have different results. I was surprised to see that these results are different depending on whether I use the sense or anti-sense sequence but so far the only difference I've been able to track down is that the bam file (and bam index and sorted bam file) sizes are different. I'm not sure how to track down individual alignments that are different, but am surprised that bwa isn't agnostic to sense.

ADD REPLY • link 6.8 years ago by dp ▴ 50

0

Entering edit mode

You say "assembly graphs" - bwa does not produce that - that sounds like a different tool producing a different output. Perhaps the tool producing your assembly graph caused the problem.

As I said before you need to first quantify the "difference" or at least have some evidence that the alignments are different. Sizes are not proof. I very much doubt that your alignments will be different in a systematic way when reverse complementing the reference sequences.

ADD REPLY • link 6.8 years ago by Istvan Albert 102k

0

Entering edit mode

@Istvan - no, the other parts of the pipeline are deterministic:

First I assemble the reads, then, because the fastg format for assembly graphs contains both the forward and reverse complement sequences for each edge, I remove one, then I align the reads to the contigs to create a graph of paired end connections. I get different connectivity graphs when I choose the forward sequences in the fastg from when I choose the first sequence that appears, either forward or reverse complement.

The only thing that changes is the sense of some of the sequences being aligned to. I have observed different bam file sizes, but perhaps this is to be expected even if the alignments are exactly the same. My questions are:

1) How can I track down if there are some different alignments in the bam files?

2) Does anyone know whether it makes sense that BWA would do this, and understand why that would be the case? I was under the impression that the aligner should be agnostic to the sense of the reference sequences.

ADD REPLY • link 6.8 years ago by dp ▴ 50

0

Entering edit mode

I think to clarify this you need to let go of all the other aspects of the problem. You do a lot of things and claim that every other step is deterministic yet it is not clear that they are. I think you are jumping ahead to saying bwa does something incorrect where it is far more likely that you not repeating exactly the same process - the data is not actually the same.

Take your sequences align them with BWA. Take an index stat, count how many reads align on one strand versus the other: flagstat and indxstats can do that. Now reverse complement the references and do the same. But note, don't just select another dataset that you "think" should be the reverse complement. Reverse complement the actual data you have used in the previous step. Now rerun the alignments then flagstat and indxstats this will summarize some of the differences. Now turn both files into BED files and visualize them. I think you'll find that both are the same.

ADD REPLY • link 6.8 years ago by Istvan Albert 102k

0

Entering edit mode

@istvan - I did this with no change, as expected.

There aren't two datasets, there is one dataset which is in a fastg file so it contains the forward and reverse complement sequence of every node.

I create a fasta file by dumping half of these. In one case I keep only the forward sense nodes, in the other case I use the first direction, forward or reverse complement (think of this as randomly selected which to keep). The stats that you suggested to look at are slightly different for each. When I take the reverse complement myself instead of reading it from the fastg file, the results are of course the same as when I read the reverse complements from the file.

Any more suggestions as to why bwa works differently on the all forward vs the some forward some reverse nodes?

ADD REPLY • link 6.8 years ago by dp ▴ 50

0

Entering edit mode

I did this with no change, as expected.

So if you did this it means that bwa does actually work as expected.

We keep talking about "differences" but in this entire thread, you have never actually shown a difference. File size is NOT a difference. Why don't you list some actual differences?

ADD REPLY • link 6.8 years ago by Istvan Albert 102k

0

Entering edit mode

Please also only differences with MAPQ > 0.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

I don't know how to find the differences - that is what I asked. Using flagstat as you suggested I get:

3214312 + 0 in total (QC-passed reads + QC-failed reads) \\ 3214312 + 0 in total (QC-passed reads + QC-failed reads)

3183485 + 0 mapped (99.04%:-nan%) \\ 3183473 + 0 mapped (99.04%:-nan%)

3214312 + 0 paired in sequencing \\ 3214312 + 0 paired in sequencing

2979678 + 0 properly paired (92.70%:-nan%) \\ 2978322 + 0 properly paired (92.66%:-nan%)

3169922 + 0 with itself and mate mapped \\ 3169906 + 0 with itself and mate mapped

13563 + 0 singletons (0.42%:-nan%) \\ 13567 + 0 singletons (0.42%:-nan%)

113930 + 0 with mate mapped to a different chr \\ 113804 + 0 with mate mapped to a different chr

107779 + 0 with mate mapped to a different chr (mapQ>=5) \\ 107765 + 0 with mate mapped to a different chr (mapQ>=5)

idxstats shows there are many edges with small differences in the number of mapped and unmapped. For example:

EDGE_1811_length_385_cov_18.5182        385     221     5 \\ EDGE_1811_length_385_cov_18.5182        385     216     5

Why these differences exist is a mystery to me since the sequences are the same, except in some cases one reference contains the reverse complement of the other. Is that how BWA is expected to work?

ADD REPLY • link updated 6.8 years ago by Devon Ryan 104k • written 6.8 years ago by dp ▴ 50

0

Entering edit mode

short read aligners are not "optimal" aligners, they are not guaranteed to find neither the best nor all alignments, though in general make very few mistakes. These mistakes are usually for alignments that are not perfect to begin with and could be false positives or false negatives.

For example, in this case, the number of mapped reads differ by 12 alignments out of over 3 million so that makes it about .0004 percent!

I would not lose much sleep over that. I would not quantify that as different beyond expected.

ADD REPLY • link 6.8 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks. Ineresting that even the small number of differences makes a difference when looking for paths in the graph formed by contigs and the paired-end reads connecting them. I guess the 1300 or so differences in the paired reads makes the difference. Wish I knew why there's a difference depending on the sense of the reference sequences, I thought it was reasonable to assume the aligner would be completely agnostic to this.

ADD REPLY • link 6.8 years ago by dp ▴ 50