Dear all, I am using bowtie2 to filter paired reads that do not map against a host genome. I use bowtie2 for mapping against the host sequence and then samtools to filter required unmapped reads. I have one question: in the bowtie output log i read:
If 13’843’083 pairs aligned 0 times concordantly or discordantly, how come some reads still aligned 1 (3’192’732) or >1 (11’569’420) times?
Thank you for any help in understanding this.
Thank you! that makes sense. But from my understanding, those 27,686,166 reads (13,843,083 read pairs) are part of the pairs that aligned 0 times concordantly or discordantly. how come some reads still align?
It means in 13.8M fragments (27.7M reads), the two reads from the pair didn't match up in terms of alignment (i.e. what is expected based on library generation protocol).
The section then breaks down those pairs of reads into single reads instead of pairs. The 3 million are reads that aligned uniquely and the 11 million are reads that were multi-mappers.
thank you for your answer. sorry if mine is a stupid question, as i am quite new to this, what does it mean "what is expected based on library generation protocol"? I thought the number 13’843’083 pairs (27’686’166 mates) that are defined as being aligned 0 times concordantly or discordantly, was a number generated by bowtie2 analysis? if they are aligned 0 times, how come some single reads still align?
You have pairs of reads... they should align next to each other if you made a perfect library.
If they don't you can look at each pair of mates and put them together in boxes of concordant, discordant, and other. The Other category includes mates that align to different chromosomes or mates where one of the reads doesn't align at all.
If you're doing standard Illumina sequencing, then you expect R1 and R2 align on opposite strands, pointing towards each other (ie first sequenced based is at the extremity of the alignment) approximate ~400bp apart from each other (this distance is determined by the library fragment size). Since one read is on the forward strand and the other on the reverse this is known as "FR" orientation. The sequence reads from one end of the DNA fragment, flips it over, then reads from the other end like so:
There are some less common library preparation protocols that result in different expected read orientations (RF or FF). Bowtie2 is counting the reads that don't align as expected.
thank you again, benformatics and d-cameron. my library comes from a sort of "environmental" sample (eg. mammal blood) where i am trying to remove the host genome to find what else - bacteria, viruses, etc. - is found in it. does this make a difference in how the reads should align? specifically, in the pairs that align 0 times discordantly to the host genome (eg. dog genome), what are the single reads (eg. 3’192’732) still aligning to?