Hi all,
I'm using bowtie2 to map short illumina reads (101 base long reads) back to genes as my reference database. However I noticed something odd in the sam file earlier. I'm using paired end reads and keeping multimaps also.
I originally thought when there is a paired end read that maps to the same gene, the two pairs of the reads should be listed consecutively, but I have noticed a few cases that are actually not. For example:
read_1::205766 83 S1_ACR74212.1 667 255 101M = 653 -115 GGTAAGGCACCTGTAGGACGTCCAGGACCATGTACTCCATGGGGCAAACCTGCACTTGGCTTAAAGACCAGAAAGAAGAACAAACAGTCTAACAAGATGAT 5655;5:5<6?<<85;55;9675=A9C:?;=9?9<?55:86<5@:=@<69A=856A@;A=>?A7<>:=<69:<:::??::<86?E?>;@B?A=6;:;8:5< AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP
read_2::205766 163 S1_ACR74212.1 653 255 101M = 667 115 ACGGTGGTGGAGAAGGTAAGGCACCTGTAGGACGTCCAGGACCATGTACTCCATGGGGCAAACCTGCACTTGGCTTAAAGACCAGAAAGAAGAACAAACAG <7:;7:9@=?>@><?@?:?C7@>9=@;=:@;<:6=E><=B7;?BC:=?@9:@?8@96:66;9A7C6699=AD6:;7<6>@869A@6<66C:6@?86=666< AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP
In the example above the read '205766' has both of its pairs mapped to the gene S1_ACR74212.1 and they appear in this sequence with this order. i.e. consecutively by listing pair 1 first and pair 2 following pair 1 immediately.
However this is not the case all the time. I came across the following example:
read_1::205932 81 S1_ACR76435.1 516 0 4M7I5M3I82M = 530 0 TTCAGACGTGTGCTCTTCCGATCTCGGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAA 7678:<>?BB?8A?>B?@ACCA9A=>>>?@=A@BB:?8A>@A9AA@BAAA@7>A:C@D@ABA?>ADACBCCC@:D@DA@?BA?C>CCCAAABB@A>@=;<> AS:i:-61 XN:i:0 XM:i:5 XO:i:2 XG:i:10 NM:i:15 MD:Z:0A0G5T4T0A77 YT:Z:UP
read_1::205932 337 S1_ACR74576.1 516 0 4M7I5M3I82M S1_ACR76435.1 530 0 TTCAGACGTGTGCTCTTCCGATCTCGGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAA 7678:<>?BB?8A?>B?@ACCA9A=>>>?@=A@BB:?8A>@A9AA@BAAA@7>A:C@D@ABA?>ADACBCCC@:D@DA@?BA?C>CCCAAABB@A>@=;<> AS:i:-61 XN:i:0 XM:i:5 XO:i:2 XG:i:10 NM:i:15 MD:Z:0A0G5T4T0A77 YT:Z:UP
read_2::205932 161 S1_ACR76435.1 530 0 101M = 516 0 CCGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAAAGATCGGAAGAGCGTCGTGTAGGG <<><>>@B@?@@@BBBB?BAA@BBBBCAB@@AB?@?A>AABCDDAA@BA?A@D@B>@C=B?A?C@;@>C?CBC<>=?>>?>==@@:;=><?@>@:?7<798 AS:i:-55 XN:i:0 XM:i:13 XO:i:0 XG:i:0 NM:i:13 MD:Z:1G76A1C0A0C0C1T2A2C1A2C2C0C0 YT:Z:UP
read_2::205932 417 S1_ACR74576.1 530 0 101M S1_ACR76435.1 516 0 CCGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAAAGATCGGAAGAGCGTCGTGTAGGG <<><>>@B@?@@@BBBB?BAA@BBBBCAB@@AB?@?A>AABCDDAA@BA?A@D@B>@C=B?A?C@;@>C?CBC<>=?>>?>==@@:;=><?@>@:?7<798 AS:i:-55 XN:i:0 XM:i:13 XO:i:0 XG:i:0 NM:i:13 MD:Z:1G76A1C0A0C0C1T2A2C1A2C2C0C0 YT:Z:UP
As you can see in this second example I have the read with ID '205932' that is multi-mapped between the two genes S1_ACR74576.1 and S1_ACR76435.1. However the order in which the read pairs are listed is not what I expected. Specifically for both maps to the two genes pair one is appearing first in the sam file and then pair 2 for the two genes are following: i.e. the order is 'read_1::205932' then 'read_1::205932' then 'read_2::205932' then 'read_2::205932'
instead of what I expected to be as: read_1::205932' then 'read_2::205932' then 'read_1::205932' then 'read_2::205932'
in most of the cases it is respecting the order however however there are some cases (like this example that I just posted) where this order and the pairs being listed consecutively are not being respected. Is there a bowtie2 parameter that takes care of this or it's usually not necessary for them to be consecutive? Any advice would be appreciated. Thanks.
UPDATE!!
this is the order of appearance of my reads (for this particular read in this case) in my fastq files for pair 1 and pair 2 respectively:
READ 205932 pair1
@read_1::205932 1:Bacteroides_fragilis_638R_uid84217
@read_1::205932 1:Bacteroides_vulgatus_ATCC_8482_uid58253
@read_1::205932 1:Bifidobacterium_dentium_Bd1_uid43091
@read_1::205932 1:Coprococcus_catus_GD_7
@read_1::205932 1:Eubacterium_rectale_ATCC_33656_uid59169
@read_1::205932 1:Haemophilus_parainfluenzae_T3T1_uid72801
READ 205932 pair2
@read_2::205932 2:Bacteroides_fragilis_638R_uid84217
@read_2::205932 2:Bacteroides_vulgatus_ATCC_8482_uid58253
@read_2::205932 2:Bifidobacterium_dentium_Bd1_uid43091
@read_2::205932 2:Coprococcus_catus_GD_7
@read_2::205932 2:Eubacterium_rectale_ATCC_33656_uid59169
@read_2::205932 2:Haemophilus_parainfluenzae_T3T1_uid72801
> --reorder Guarantees that output SAM records are printed in an order corresponding to the order of the reads in the original input file,See Devon's comment below. My bad.I got the same result unfortunately :(( I updated my original post and included a snapshot of the reasd as they appear in my fastq files, maybe that would help a bit more. The command I used with bowtie2 now is the following:
You can always name-sort the resulting alignment files if you want them to be next to each other.