Question

Paired bowtie2: 91% overall alignment but 0.01% concordant - should I be worried?

0

Entering edit mode

10.2 years ago

Bilal Akil ▴ 30

I just ran my first successful (I think) DNA alignment operation with `bowtie2` on some paired end data (namely SRR390728) and these were the results I got:

7178576 reads; of these:
  7178576 (100.00%) were paired; of these:
    7177909 (99.99%) aligned concordantly 0 times
    225 (0.00%) aligned concordantly exactly 1 time
    442 (0.01%) aligned concordantly >1 times
    ----
    7177909 pairs aligned concordantly 0 times; of these:
      1624832 (22.64%) aligned discordantly 1 time
    ----
    5553077 pairs aligned 0 times concordantly or discordantly; of these:
      11106154 mates make up the pairs; of these:
        1231919 (11.09%) aligned 0 times
        1884213 (16.97%) aligned exactly 1 time
        7990022 (71.94%) aligned >1 times
91.42% overall alignment rate

So my first impression was "WOOOOO!!! AWESOME", but on further inspection I saw:

7177909 (99.99%) aligned concordantly 0 times

And so my second impression was "hmmmm.......", and now here I am. From the Bowtie 2 manual I found that:

- "A pair that aligns with the expected relative mate orientation and with the expected range of distances between mates is said to align "concordantly". If both mates have unique alignments, but the alignments do not match paired-end expectations (i.e. the mates aren't in the expected relative orientation, or aren't within the expected distance range, or both), the pair is said to align "discordantly"."

Now I'm confused about how this information applies to my data. Should I interpret this as a problem with my dataset? Is my alignment reliable? What else should this lead me to consider?

alignment bowtie2 • 7.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Bilal Akil ▴ 30

5

Entering edit mode

This is an RNAseq dataset. Presumably you meant to align with hisat or tophat.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Devon Ryan 105k

0

Entering edit mode

Would that be the likely explanation behind this issue?

ADD REPLY • link 10.2 years ago by Bilal Akil ▴ 30

0

Entering edit mode

That'll at least explain a large chunk of the problem. Did you trim this data at all?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Devon Ryan 105k

0

Entering edit mode

I didn't trim it - don't really know what anything is at this point, but I let it run with the --local option overnight as I understand that is more lenient about the start and end of reads (which I assume is what was in subject when you asked about trimming) and now we've got 0.14% concordant alignments :P Not so much better.

I'm going to try and run it through Tophat now.

Is "trimming" something that you recommend should always be done? I'm still trying to figure out a kind of standard procedure so it'd help to know what other people recommend :)

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Bilal Akil ▴ 30

0

Entering edit mode

10.2 years ago

Ian 6.1k

That is worrying. It might be worth checking whether the R1 and R2 read files are similarly sorted. It reads like the mapper has been able to map the reads, but not make sense of the pairing. Also, what was the insert size for the experiment. By default bowtie2 only allows up to 500bp been pairs.

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Ian 6.1k

0

Entering edit mode

This is what the beginning of the R1 and R2 files look like:

==> SRR390728_1.fastq <==
@SRR390728.1 1 length=36
CATTCTTCACGTAGTTCTCGAGCCTTGGTTTTCAGC
+SRR390728.1 1 length=36
;;;;;;;;;;;;;;;;;;;;;;;;;;;9;;665142
@SRR390728.2 2 length=36
AAGTAGGTCTCGTCTGTGTTTTCTACGAGCTTGTGT
+SRR390728.2 2 length=36
;;;;;;;;;;;;;;;;;4;;;;3;393.1+4&&5&&
@SRR390728.3 3 length=36
CCAGCCTGGCCAACAGAGTGTTACCCCGTTTTTACT
+SRR390728.3 3 length=36
-;;;8;;;;;;;,*;;';-4,44;,:&,1,4'./&1
@SRR390728.4 4 length=36
ATAAAATCAGGGGTGTTGGAGATGGGATGCCTATTT
+SRR390728.4 4 length=36
1;;;;;;,;;4;3;38;8%&,,;)*;1;;,)/%4+,
@SRR390728.5 5 length=36
TTAAGAAATTTTTGCTCAAACCATGCCCTAAAGGGT
+SRR390728.5 5 length=36
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;9445552

==> SRR390728_2.fastq <==
@SRR390728.1 1 length=36
GATGGAGAATGACTTTGACAAGCTGAGAGAAGNTNC
+SRR390728.1 1 length=36
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;96&&&&(
@SRR390728.2 2 length=36
TCCAGCTGACCCACTCCCTGGGTGGGGGGACTGGGT
+SRR390728.2 2 length=36
;;;;;;;;;;;;;;;;;;;;;<9;<;;;;;464262
@SRR390728.3 3 length=36
TATTTATTATTATTATTTTGAGACAGAGCATTGGTC
+SRR390728.3 3 length=36
9;;;;;;669;;99;;;;;-;3;2;0;+;7442&2/
@SRR390728.4 4 length=36
CTGCACACCTTGGCCTCCCAAATTGCTGGGATTACA
+SRR390728.4 4 length=36
;1;;);;;;;;;4;(;1;;;;24;;;;41-444//0
@SRR390728.5 5 length=36
TCTGTAATAAATAGGGCTGGGAAAACTGGCAAGCCA
+SRR390728.5 5 length=36
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;446662

Is this... normal looking? :P I don't think any of the reads come close to 500bp in length.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Bilal Akil ▴ 30

Ram · Accepted Answer · 2015-05-22

2

Entering edit mode

10.2 years ago

Bilal Akil ▴ 30

Devon Ryan nailed it in his comment to the main question:

"This is an RNAseq dataset. Presumably you meant to align with hisat or tophat."

I followed his advice and tried aligning it with Tophat and these were the results:

Left reads:
          Input     :   7178576
           Mapped   :   6553268 (91.3% of input)
            of these:   2950549 (45.0%) have multiple alignments (966184 have >20)
Right reads:
          Input     :   7178576
           Mapped   :   6448797 (89.8% of input)
            of these:   2912329 (45.2%) have multiple alignments (966184 have >20)
90.6% overall read mapping rate.

Aligned pairs:   6114755
     of these:   2809064 (45.9%) have multiple alignments
                  253417 ( 4.1%) are discordant alignments
81.7% concordant pair alignment rate.

Although I'm not certain whether that's a good result or not, it sure must be better than 0.01% concordant aligns.

Thanks Devon Ryan.

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Bilal Akil ▴ 30

0

Entering edit mode

Odd though that almost 1 million left and 1 million right reads aligned >20 times..

ADD REPLY • link 10.2 years ago by Bilal Akil ▴ 30

0

Entering edit mode

That's much better. The multiple alignment rate does seem rather high, but perhaps that's innate to the experiment (I haven't looked up the details of that sample).

ADD REPLY • link 10.2 years ago by Devon Ryan 105k