Hi,
I have paired end reads Illumina 1.9 encoding. I am trying to figure out a way to see the reads that are multimapped, from the sam file.
To test this I created a database (an index file) using bowtie2 from a fasta file that contains a whole genome for Bacteroides vulgatus ATCC 8482 species but I copied the same genome twice in the same fasta file before creating the index file. I did this intentionally to check for duplicate reads. I deally any read that will be mapped to this genome should be mapped twice since I have the exact same sequence in my database twice. To create the index file I issued the following command:
$BT2_HOME/bowtie2-build /home/mstambou/represantative_gut_species/duplidate_genomes.fna duplicate_genomes_indices/duplicate_genomes_db
and then to map the reads to the created database I issued the following command:
$BT2_HOME/bowtie2 --threads 24 -x duplicate_genomes_indices/duplicate_genomes_db -1 /home/team/hutternhower14/info/trimmed_1/gut_DNA/SRR769540-trimmed.1.fq -2 /home/team/hutternhower14/info/trimmed_1/gut_DNA/SRR769540-trimmed.2.fq -S X317802115/mapping_duplicate/SRR769540_duplicate.sam
the output of my last command is the following:
4280027 reads; of these:
4280027 (100.00%) were paired; of these:
4144369 (96.83%) aligned concordantly 0 times
2 (0.00%) aligned concordantly exactly 1 time
135656 (3.17%) aligned concordantly >1 times
----
4144369 pairs aligned concordantly 0 times; of these:
0 (0.00%) aligned discordantly 1 time
----
4144369 pairs aligned 0 times concordantly or discordantly; of these:
8288738 mates make up the pairs; of these:
8251346 (99.55%) aligned 0 times
16 (0.00%) aligned exactly 1 time
37376 (0.45%) aligned >1 times
3.61% overall alignment rate
This output is in agreement of what I expect to see, it's saying non of the reads (other than 2 for some reason) are mapped 1 time, and most of them (the ones that are mapped) are mapped more than 1 time. But when I later check the sam output file I only see each read ID once (i.e. each read only has one row in the sam output). I thought I should see more than one occurrence of the read IDs that are multi mapped. I really don't understand the reason behind this. I am pasting a sample of my sam file just for reference (the rows are sorted alphabetically using the read IDs).
@HD VN:1.0 SO:unsorted
@SQ SN:CP000139.1 LN:5163189
@SQ SN:copy_2 LN:5163189
@PG ID:bowtie2 PN:bowtie2 VN:2.3.4.2 CL:"/home/mstambou/bowtie2/bowtie2-2.3.4.2/bowtie2-align-s --wrapper basic-0 --threads 24 -x duplicate_genomes_indices/duplicate_genomes_db -S X317802115/mapping_duplicate/SRR769540_duplicate.sam -1 /home/team/hutternhower14/info/trimmed_1/gut_DNA/SRR769540-trimmed.1.fq -2 /home/team/hutternhower14/info/trimmed_1/gut_DNA/SRR769540-trimmed.2.fq"
SRR769540.1001033.1 89 copy_2 4066024 0 99M = 4066024 0 GTATCACATTTATCACAGTAGGTCTGATGGCAATCGCATTTATGTGTTTCTCTGGTTTGAACATCTAAAAAGAAAGGAATAAGACAATGGATATGAGCA
SRR769540.1001033.2 165 copy_2 4066024 0 * = 4066024 0 TGGCTTACTCTGACGTGCCTGCTCCGTTGAAGGGCTTGGGTATCACATTTATCACAGTAGGTCTGATGGCAATCGCATTTATGTGTTTCTCTGGTTTGAAC B
SRR769540.1001041.1 99 copy_2 2179441 1 101M = 2179495 155 TATCTCCATTAGCGAAAACAGGCGCATGATACATCTTTCCACAGGCCTTTGCATAAGCAATCTGAGTGGACTCTTCATCCAAAGCCTCCCAGCGAGCTTTT :??BDB:2C<<A;EEEAECD??+1???;B4?B>B*BA
SRR769540.1001041.2 147 copy_2 2179495 1 101M = 2179441 -155 AAGCAATCTGAGTGGACTCTTCATCCAAAGCCTCCCAGCGAGCTTTTACACGCATCTCCTGCACTGCTTCATAAGCTAACTTCTGTACGTGAAAGCGATCG CCAC>A>>>A>C
SRR769540.1001043.1 101 copy_2 2179939 0 * = 2179939 0 TAGCGGTTCTGAAGACCAGCTAAGTTCTGGAACACGCATGACATGGACAGATTTCTGGAACGAATGACCGAAAT 7>=AA=7<=B<4,<<7+233+?>=2?=
SRR769540.1001043.2 153 copy_2 2179939 1 101M = 2179939 0 CCAATAAATCCCCGTCAACTCCAGATAAAACACCTAAGAGCTTGCAACTGATCGGATAGCTATCAATATAGTTCTTTTAAAAAAGACGCAAAATCTTGCGT
SRR769540.1001566.1 89 copy_2 4992489 0 7M1I5M1D88M = 4992489 0 GATTTCTCTTTCTCTTTTTAGTTTTCTTAGTCAAAATATGACTATGAAAAGCGTGCTTTCTTTTGATTTTACCTGTTCCGGTAAGGGTAAACCTTTTTTTA DCDCDCDDCC?CDDDDCDDCDDDDCCDDDDDDEECEC
SRR769540.1001566.2 165 copy_2 4992489 0 * = 4992489 0 GTTCCTTAACCTGGCTTAGATTTGTTGTATCAACAGTTGTAGAGTAGCACAGATTTCTCTTTCTCTTTTTAGTTTTCTTAGTCAAAATATGACTATGAAAA
SRR769540.1001567.1 89 copy_2 4992500 0 101M = 4992500 0 CTCTTCTTAGTTTTCTTAGTCAAAATGTGACTGTGAAAAGCGTGCTTTCTTTTGATTTTACCTGTTCCGGTAAGGGTAAACCTTTTTTTAGAACCGGAGTT A?>CCCCACDCCCCACDDDDCDBCC:CAC
SRR769540.1001567.2 165 copy_2 4992500 0 * = 4992500 0 TTTACGCTTTTGATTACTTCATTGCTAAGAGTTCCTTAACCTGGCTTACATTTGTTGTATCAACAGTTGTAGAGTAGCACAGATTTCTTTTTCTCTTCTTA
SRR769540.1001573.1 69 CP000139.1 4992236 0 * = 4992236 0 CGCTGCGTGAAGTAGCGGCGCTAACATTCAAAAAAAGTAAAACTATGCCAAGATCAGTAAATCATGTTGCTTCAAAAGCAAGAAGAAAGAAAATTTTGAAA
SRR769540.1001573.2 137 CP000139.1 4992236 0 84M1D17M = 4992236 0 AAGTAACCTCTGGTTAATTTCAAAATTTTCTTTCTTCTTGCTTTTGAAGCAACATGATTTACTGATCTTGGCATAGTTTTACTTTTTTTGAATGTTAGCGC C
As you can see each read ID is only present once. Am I issuing the wrong command in bowtie? that maybe it's only showing me the primary hits? Im very new to this field and am sorry in advance for the novice question.
thanks.
Thank you for your answer. Yeah right after the post I read more about bowtie2 and found that out. I used -a parameter that reports all the alignments that it finds for the reads. So your saying if I leave it to the default settings then bowtie2 will guarantee to return the best possible hit for that read?
Yes, by default it'll return the best possible hit (with a few caveats, having to do with seed mismatches and the -D and -R options).
I see. Thanks for all the input
How would it determine what the 'best' mapping is if both regions are (as in this toy model) exactly the same? My initial guess is that it would simply report the first mapping if there are two equally good mappings. But in the above example, copy 2 is reported. Maybe after indexing, copy 2 is placed ahead of copy 1? Or maybe bowtie2 does not report the first mapping but the last mapping?
In the case of equally high scoring alignments the result is pseudo-random.