Question

do paired end reads both mapped to the same gene appear consecutive to each other in SAM file (using bowtie2)

0

Entering edit mode

5.7 years ago

Moses ▴ 150

Hi all,

I'm using bowtie2 to map short illumina reads (101 base long reads) back to genes as my reference database. However I noticed something odd in the sam file earlier. I'm using paired end reads and keeping multimaps also.

I originally thought when there is a paired end read that maps to the same gene, the two pairs of the reads should be listed consecutively, but I have noticed a few cases that are actually not. For example:

 read_1::205766 83  S1_ACR74212.1   667 255 101M    =   653 -115    GGTAAGGCACCTGTAGGACGTCCAGGACCATGTACTCCATGGGGCAAACCTGCACTTGGCTTAAAGACCAGAAAGAAGAACAAACAGTCTAACAAGATGAT   5655;5:5<6?<<85;55;9675=A9C:?;=9?9<?55:86<5@:=@<69A=856A@;A=>?A7<>:=<69:<:::??::<86?E?>;@B?A=6;:;8:5<   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP

read_2::205766  163 S1_ACR74212.1   653 255 101M    =   667 115 ACGGTGGTGGAGAAGGTAAGGCACCTGTAGGACGTCCAGGACCATGTACTCCATGGGGCAAACCTGCACTTGGCTTAAAGACCAGAAAGAAGAACAAACAG   <7:;7:9@=?>@><?@?:?C7@>9=@;=:@;<:6=E><=B7;?BC:=?@9:@?8@96:66;9A7C6699=AD6:;7<6>@869A@6<66C:6@?86=666<   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP

In the example above the read '205766' has both of its pairs mapped to the gene S1_ACR74212.1 and they appear in this sequence with this order. i.e. consecutively by listing pair 1 first and pair 2 following pair 1 immediately.

However this is not the case all the time. I came across the following example:

read_1::205932  81  S1_ACR76435.1   516 0   4M7I5M3I82M =   530 0   TTCAGACGTGTGCTCTTCCGATCTCGGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAA   7678:<>?BB?8A?>B?@ACCA9A=>>>?@=A@BB:?8A>@A9AA@BAAA@7>A:C@D@ABA?>ADACBCCC@:D@DA@?BA?C>CCCAAABB@A>@=;<>   AS:i:-61    XN:i:0  XM:i:5  XO:i:2  XG:i:10 NM:i:15 MD:Z:0A0G5T4T0A77   YT:Z:UP

read_1::205932  337 S1_ACR74576.1   516 0   4M7I5M3I82M S1_ACR76435.1   530 0   TTCAGACGTGTGCTCTTCCGATCTCGGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAA   7678:<>?BB?8A?>B?@ACCA9A=>>>?@=A@BB:?8A>@A9AA@BAAA@7>A:C@D@ABA?>ADACBCCC@:D@DA@?BA?C>CCCAAABB@A>@=;<>   AS:i:-61    XN:i:0  XM:i:5  XO:i:2  XG:i:10 NM:i:15 MD:Z:0A0G5T4T0A77   YT:Z:UP

read_2::205932  161 S1_ACR76435.1   530 0   101M    =   516 0   CCGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAAAGATCGGAAGAGCGTCGTGTAGGG   <<><>>@B@?@@@BBBB?BAA@BBBBCAB@@AB?@?A>AABCDDAA@BA?A@D@B>@C=B?A?C@;@>C?CBC<>=?>>?>==@@:;=><?@>@:?7<798   AS:i:-55    XN:i:0  XM:i:13 XO:i:0  XG:i:0  NM:i:13 MD:Z:1G76A1C0A0C0C1T2A2C1A2C2C0C0   YT:Z:UP

read_2::205932  417 S1_ACR74576.1   530 0   101M    S1_ACR76435.1   516 0   CCGACTACGGAAAGGATGTGATACAGCTTGAAAAGCGAATGGCAGAGCTTGCCGCCGCAAACGCAGCAAGAACTAAAAGATCGGAAGAGCGTCGTGTAGGG   <<><>>@B@?@@@BBBB?BAA@BBBBCAB@@AB?@?A>AABCDDAA@BA?A@D@B>@C=B?A?C@;@>C?CBC<>=?>>?>==@@:;=><?@>@:?7<798   AS:i:-55    XN:i:0  XM:i:13 XO:i:0  XG:i:0  NM:i:13 MD:Z:1G76A1C0A0C0C1T2A2C1A2C2C0C0   YT:Z:UP

As you can see in this second example I have the read with ID '205932' that is multi-mapped between the two genes S1_ACR74576.1 and S1_ACR76435.1. However the order in which the read pairs are listed is not what I expected. Specifically for both maps to the two genes pair one is appearing first in the sam file and then pair 2 for the two genes are following: i.e. the order is 'read_1::205932' then 'read_1::205932' then 'read_2::205932' then 'read_2::205932'

instead of what I expected to be as: read_1::205932' then 'read_2::205932' then 'read_1::205932' then 'read_2::205932'

in most of the cases it is respecting the order however however there are some cases (like this example that I just posted) where this order and the pairs being listed consecutively are not being respected. Is there a bowtie2 parameter that takes care of this or it's usually not necessary for them to be consecutive? Any advice would be appreciated. Thanks.

UPDATE!!

this is the order of appearance of my reads (for this particular read in this case) in my fastq files for pair 1 and pair 2 respectively:

READ 205932 pair1

@read_1::205932 1:Bacteroides_fragilis_638R_uid84217
@read_1::205932 1:Bacteroides_vulgatus_ATCC_8482_uid58253
@read_1::205932 1:Bifidobacterium_dentium_Bd1_uid43091
@read_1::205932 1:Coprococcus_catus_GD_7
@read_1::205932 1:Eubacterium_rectale_ATCC_33656_uid59169
@read_1::205932 1:Haemophilus_parainfluenzae_T3T1_uid72801

READ 205932 pair2

@read_2::205932 2:Bacteroides_fragilis_638R_uid84217
@read_2::205932 2:Bacteroides_vulgatus_ATCC_8482_uid58253
@read_2::205932 2:Bifidobacterium_dentium_Bd1_uid43091
@read_2::205932 2:Coprococcus_catus_GD_7
@read_2::205932 2:Eubacterium_rectale_ATCC_33656_uid59169
@read_2::205932 2:Haemophilus_parainfluenzae_T3T1_uid72801

SAM bowtie2 shortreads • 2.1k views

ADD COMMENT • link updated 5.7 years ago by Devon Ryan 105k • written 5.7 years ago by Moses ▴ 150

0

Entering edit mode

~~> --reorder Guarantees that output SAM records are printed in an order corresponding to the order of the reads in the original input file,~~

even when -p is set greater than 1. Specifying --reorder and setting -p greater than 1 causes Bowtie 2 to run somewhat slower and use somewhat more memory than if --reorder were not specified. Has no effect if -p is set to 1, since output order will naturally correspond to input order in that case.

See Devon's comment below. My bad.

ADD REPLY • link 5.7 years ago by ATpoint 86k

0

Entering edit mode

I got the same result unfortunately :(( I updated my original post and included a snapshot of the reasd as they appear in my fastq files, maybe that would help a bit more. The command I used with bowtie2 now is the following:

bowtie2 --reorder -a -x ../target_gene_indices/S1/S1_target_seqs_db -1 ../metagenome_sim_10/S1/S1_1.fq.gz -2 ../metagenome_sim_10/S1/S1_2.fq.gz -S ../target_gene_maps/S1.mg.sam.reordered

ADD REPLY • link 5.7 years ago by Moses ▴ 150

0

Entering edit mode

You can always name-sort the resulting alignment files if you want them to be next to each other.

ADD REPLY • link 5.7 years ago by GenoMax 148k

score 1 · Answer 1 · 2019-04-08

1

Entering edit mode

5.7 years ago

Rob 6.9k

You might consider using --no-discordant and --no-mixed to avoid the complicated patterns that can occur when one alignment record (e.g. for read 1) is paied with multiple alignments (for read 2, or vice versa). Basically, for concordant alignments, Bowtie2 will fillow your expected format. With discordant or mixed (orphan) alignments, things can become very complicated and messy to parse.

ADD COMMENT • link 5.7 years ago by Rob 6.9k

0

Entering edit mode

thanks Rob for this guide. I made a postprocessing script where I check if there are reads like that and if both of the pairs are maped to the same gene/genome then I rearrange them such that read1 to gene and read2 to the same gene appear immidiately consecutive in the sam file. Also in the script I added another filtering step where if I encounter cases where a paired end read where both of it's pairs are mapped, based on what's indicated by their flags, however if each one of the pairs are mapped to a different fragment or gene or genome (i.e. to a different reference) then I am ignoring (filtering out these reads from the postpossessed sam file).

ADD REPLY • link 5.7 years ago by Moses ▴ 150

score 0 · Answer 2 · 2019-04-08

0

Entering edit mode

5.7 years ago

Devon Ryan 105k

Bowtie2, like most aligners, will always output paired-end reads in chunks where read #1 and its multimapping alignments comes first and then read #2 and its multimapping alignments comes next. That's usually easy enough to parse. If you need the primary and secondary alignment in a different order then you'll need to write a function to do this, as neither samtools nor most aligners will do it for you.

ADD COMMENT • link 5.7 years ago by Devon Ryan 105k

0

Entering edit mode

yeah I'm working on my own script to take care of this. But what's surprising me is that in most of the cases (multimapped cases) Bowtie2 is respecting the order of pairing in the sam file. i.e. Let's say I have a paired end read R1 and R2, that is mapped to 3 different genes g1,g2 and g3. In most of the cases in my sam file the pairing is listed as:

R1 g1 R2 g1 R1 g2 R2 g2 R1 g3 R2 g3

and in some rare cases which I don't understand why it's listing something like this

R1 g1 R1 g2 R2 g1 R2 g2

should the --reorder parameter fix this or is there any other parameter or function that's already defined that will take care of this?

ADD REPLY • link 5.7 years ago by Moses ▴ 150

0

Entering edit mode

--reorder is doing something different, it's ensuring that the order of the pairs matches what's in the fastq file, not that pairs themselves have any particular order.

ADD REPLY • link 5.7 years ago by Devon Ryan 105k