Perplexing Bowtie 2 results
1
0
Entering edit mode
10.1 years ago
Rob 6.9k

I have a set of Bowtie2 alignments to the transcriptome. I'm trying to make sense of the BAM file, but I'm finding some very strange results. For example there are a set of entries for a read (this is with paired-end input):

D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC  81      ENSDART00000085701      2597    0       4M1I5M1D115M    =       2610    0       GTATAAGAGACAGGCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGAA    GGGGGGGGG@GEGGE@CGGGGGGGGGGGGGGGGGGFF>GGGGEGGGGGGGGECGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGFEGFEGEGGGGGGGGGGFGFGGGGGGGGGGGGGGFFBB><3   AS:i:-54XS:i:-75 XN:i:0  XM:i:8  XO:i:2  XG:i:2  NM:i:10 MD:Z:0A3T2C1^T0G1T38T61T10T0    YT:Z:UP
D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC  337     ENSDART00000145155      133     0       7M4I101M5D13M   ENSDART00000085701      2610    0       GTATAAGAGACAGGCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGAA    GGGGGGGGG@GEGGE@CGGGGGGGGGGGGGGGGGGFF>GGGGEGGGGGGGGECGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGFEGFEGEGGGGGGGGGGFGFGGGGGGGGGGGGGGFFBB><3    AS:i:-75        XS:i:-75        XN:i:0  XM:i:8  XO:i:2  XG:i:9  NM:i:17 MD:Z:0A1C0C1T2T38T60^AAGAA5C6T0 YT:Z:UP
D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC  161     ENSDART00000085701      2610    3       125M    =       2597    0       GCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGATCTTTTTTTTTTTT    CCCBCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGFGGGGGGGGGGGGGEFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGG@#######   AS:i:-36        XS:i:-61 XN:i:0  XM:i:9  XO:i:0  XG:i:0  NM:i:9  MD:Z:38T61T11G0A2G0A0G0C3A1     YT:Z:UP
D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC  417     ENSDART00000088520      2685    3       99M5D26M        ENSDART00000085701      2597    0       GCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGATCTTTTTTTTTTTT    CCCBCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGFGGGGGGGGGGGGGEFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGG@#######    AS:i:-61        XS:i:-61        XN:i:0  XM:i:10 XO:i:1  XG:i:5  NM:i:15 MD:Z:38T3T56^AAGAA1G11G0A2G0A0G0C3A1    YT:Z:UP
D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC  417     ENSDART00000145155      142     3       99M5D26M        ENSDART00000085701      2597    0       GCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGATCTTTTTTTTTTTT    CCCBCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGFGGGGGGGGGGGGGEFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGG@#######    AS:i:-63        XS:i:-61        XN:i:0  XM:i:11 XO:i:1  XG:i:5  NM:i:16 MD:Z:38T60^AAGAA5C7G0A0C1G0A0G0C2C0A1   YT:Z:UP

In each of these records, we can see that the YT field has the value 'UP' --- from the Bowtie 2 manual:

Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.

However, when I filter unmapped reads from this file --- using samtools view -F 0x0004 --- and grep for this read, I get this exact set of records out! I'm at a loss for what is happening with these reads. Bowtie2 is saying that they failed to align either concordantly or discordantly. However, Samtools is still telling me that they're mapped (but not in a proper pair). Isn't this the definition of a discordant alignment (i.e. both ends of a paired-end read map, but not concordantly?). I'm not sure how I should interpret these records. Should I consider these reads "mapped" or not? If so, should I consider them mapped as orphans or in pairs? And if they're to be considered to be mapped in pairs, then what are the pairs --- certainly they're not the adjacent alignments and, in fact, there aren't even an even number of records so I'm not sure how they should be paired! Any help dealing with this confusion would be greatly appreciated!

RNA-Seq • 4.3k views
ADD COMMENT
4
Entering edit mode
10.1 years ago

See: http://broadinstitute.github.io/picard/explain-flags.html

81,161 denote flags for the first and second read in a pair. Both the reads have been mapped. However they have not been mapped in a proper pair. This is decided by the aligner. It may be that they are mapped very far away or very close than expected insert size. 81,161 also represent first primary alignment.

337,417 represent same as above but these alignments are not the primary alignments. This could simply because in the first or the primary alignment both reads are mapped to the same contig (= sign) but here they have been mapped to different contigs. Also, number of mismatches for the second read are higher in this alignment.

The last alignment is an alternative location for the the second read (417) in the pair above. Here second read has been aligned to a contig which is not as same as the one in the alignment above.

The total number of rows are not even because the first read in the secondary alignment is represented only once or has only one secondary alignment but the second read in the pair has two locations. This is normal.

Isn't this the definition of a discordant alignment (i.e. both ends of a paired-end read map, but not concordantly?). I'm not sure how I should interpret these records.

Yes it is.

Should I consider these reads "mapped" or not? If so, should I consider them mapped as orphans or in pairs?

These reads are mapped but not in proper pair. In the primary alignment both reads are on the same chromosome but may be they are placed very far or very close than the expected insert size. Hence the aligner has classified them as mapped but not in proper pair.

And if they're to be considered to be mapped in pairs, then what are the pairs --- certainly they're not the adjacent alignments and, in fact, there aren't even an even number of records so I'm not sure how they should be paired! Any help dealing with this confusion would be greatly appreciated!

See above.

ADD COMMENT
0
Entering edit mode

Thank you for the explanation. However, do you have any idea why Bowtie 2 is annotating these with YT:Z:UP? I would consider the read of a pair that both map, but not concordantly, as a discordant alignment --- which Bowtie 2 says it marks with YT:Z:DP. It's saying these reads aligned neither concordantly or discordantly, but it appears they did align discordantly, in exactly the way you explained. My followup question, is how would you determine the reads that represent a mapped pair here? I mean, I see how you did it by looking at the alignments and seeing what made sense, but is there any programmatic way to determine which of these mapping should be paired together?

ADD REPLY
1
Entering edit mode

May be 'YT:Z:UP' is used when aligner is not sure if the paired reads are mapped as a proper pair or not. When it is sure that they are not mapped in proper pair it uses 'DP' and when mapped in proper pair it uses 'CP'. I am confused too as why bowtie2 didn't regard read pairs mapping on different chromosomes as 'DP'. Why it is still not sure and using 'UP'. See below for the code:

if(alignedConcordant()) {
    o.append("CP");
} else if(alignedDiscordant()) {
    o.append("DP");
} else if(alignedUnpairedMate()) {
    o.append("UP");
} else if(alignedUnpaired()) {
    o.append("UU");
}
ADD REPLY
0
Entering edit mode

Go to this link: http://broadinstitute.github.io/picard/explain-flags.html

Type 81. Gives you read is paired, read is reverse strand and first in pair. Now unmark everything.

Now, to get the flag for the second read in pair, you will have to click read is paired, second in pair and mate reverse strand (you know first read is on reverse strand). You will get 161. This means both these reads are in pair.

337,417 will have all these features but it will also mark "not primary alignment" feature.

ADD REPLY
0
Entering edit mode

I see; but then, isn't every reverse strand "first-in-pair" read implicitly paired with every "second-in-pair", read whose mate is on the reverse strand? If one or both of the reads maps many places, this could lead to a large number of "implicit" alignments.

Maybe this is just the nature of the way these things are encoded, but it would seem more intuitive to just always place aligned pairs (even discordant) together, repeating the alignment for one of the mates if necessary. This existing encoding makes things particularly difficult because you have to essentially look ahead to the end of all of the records for this read, and then generate mapped pairs by implicitly pairing all "compatible" discordant alignments.

Also, I'm assuming that the UP tag placed by Bowtie 2 is a mistake, since these reads do, in fact, map discordantly (the UP should be DP).

ADD REPLY
0
Entering edit mode

There is no "second-in-pair" flag - it's "last-in-fragment". The SAM specification tries hard not to talk about pairs so the format would be future proof :)

ADD REPLY
0
Entering edit mode

I'd have to double check the code to be sure, but I believe bowtie2 considers these alignments "mixed" rather than discordant, meaning that it's aligning the reads as singletons...thus leading to the YT:Z:UP aux tag.

Processing mates like this as pairs is actually rather difficult, since not every alignment will have an obvious mate (e.g. the alignment to ENSDART00000088520:2685) and you can easily run into cases where there's more than one possible mate (e.g., read2 maps to a tandom repeat). Those end up not being that uncommon in practice. Note that the MAPQ on these is 0 to begin with, so none of these alignments are worth much...so don't stress over them :)

ADD REPLY
0
Entering edit mode

Yes, I totally forgot to say that none of these alignments will be considered by downstream tools including variant callers or RPKM generators so you should take it is easy if you have a concern that they may screw the results.

ADD REPLY
0
Entering edit mode

Actually, I'm working on my alignment-based transcript expression estimation tool (https://github.com/kingsfordgroup/sailfish/tree/feature/fastio), which was exactly what led to this question :). I was trying to expand the types of alignments to consider when estimating isoform abundance. For example, we always consider concordant alignments, but it may also be helpful to consider orphaned reads (e.g. RSEM doesn't do this but eXpress does). We can then go one step further and decide whether or not it is helpful to consider alignments of the type represented in my original post (e.g. by considering discordant alignments that happen to map to the same contig / transcript, or by considering the ends of a discordant mapping separately as orphaned reads).

ADD REPLY
0
Entering edit mode

Try STAR aligner and see how does it flag those sequences in the sam file. Does it too flag the primary alignment for those reads as discordant (81) or it flags them as concordant (83) ? You can then decide which aligner you want to use at the first level to align those reads. This is just a suggestion. You may have already put a lot of work to make your code work with Bowtie2 or Tophat's SAM file.

ADD REPLY
0
Entering edit mode

Actually, I usually use STAR and recommend it to people using my software. However, this BAM file was provided by a user who had used Bowtie 2 for alignment, and I noticed some reads like this that I'd not seen before in my STAR alignments. I'm trying to decide how to have my tool play as nicely as possible with the wide range of different aligners, so I want to make use of as little specialized info (e.g. tags) as possible.

ADD REPLY

Login before adding your answer.

Traffic: 1996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6