Question

Bowtie Outputs Hits That Are Shorter Than The Reads Aligned

1

Entering edit mode

11.5 years ago

Click downvote ▴ 720

I have a file that looks like

>rno-miR-322 MIMAT0001619 Rattus norvegicus miR-322

CAGCAGCAAUUCAUGUUUUGGA

>rno-miR-322* MIMAT0000547 Rattus norvegicus miR-322*

AAACAUGAAGCGCUGCAACA

The reads are 22 and 20 nucleotides in length, respectively.

Aligning this file to the rn4 genome with the command

bowtie -f rn4 test_file

produces the output:

rno-miR-322 MIMAT0001619 Rattus norvegicus miR-322 + chrX 22228104 CAGCAGCAACAGGGA IIIIIIIIIIIIIII 11
rno-miR-322* MIMAT0000547 Rattus norvegicus miR-322* - chr8 27772395 TGTTGCGCGCTTCTGTTT IIIIIIIIIIIIIIIIII 0 12:T>C,16:T>G

Observe that the length of the reads aligned are too short, 15 and 18 characters. Strange thing is, when I try to align the reads individually with the -c option it works:

bowtie -f rn4 -c CAGCAGCAAUUCAUGUUUUGGA

produces

0 - chrX 140000211 TCCAAAACATGAATTGCTGCTG IIIIIIIIIIIIIIIIIIIIII 0

And

bowtie -f rn4 -c AAACAUGAAGCGCUGCAACA

gives

0 - chrX 140000175 TGTTGCAGCGCTTCATGTTT IIIIIIIIIIIIIIIIIIII 0

What might possibly be wrong?

bowtie • 2.6k views

ADD COMMENT • link updated 9.9 years ago by Biostar 20 • written 11.5 years ago by Click downvote ▴ 720

2

Entering edit mode

Have you noticed that it removes only the Us? I suspect that specifying a uridine isn't supported in an input fasta file. I could check the source code, but if you just check a few more reads, you could more quickly confirm that.

ADD REPLY • link 11.5 years ago by Devon Ryan 105k

0

Entering edit mode

That's one of the first things I wondered about but the person I'm helping said that most such software supported "U"s. Should have checked it anyways. Edit: Yes, this was it. Please post as answer and I'll accept and upvote.

ADD REPLY • link 11.5 years ago by Click downvote ▴ 720

1

Entering edit mode

I'm not sure where the person you're working with came by that idea. bowtie and other similar tools are meant for high-throughput sequencing data, which should rarely, if ever, have a U in the sequence (I've never seen one at least). Is the person by chance a bench scientist? :)

ADD REPLY • link 11.5 years ago by Devon Ryan 105k

1

Entering edit mode

Yes, it is I that should have known better. Thanks!

ADD REPLY • link 11.5 years ago by Click downvote ▴ 720

0

Entering edit mode

BBMap supports U in the reference :)

It supports U in reads, also, if you add the "utot" flag (U to T).

ADD REPLY • link 9.9 years ago by Brian Bushnell 20k

score 0 · Answer 1 · 2013-11-11

Here is a python script that evades the problem by using the -c option. If anyone knows what or just have some guesses about what is happening in the question above please do help.

from __future__ import print_function
import sys
import subprocess

list_of_reads_to_align = []

with open(sys.argv[1]) as input_file:
    for line in input_file:
        if not line.startswith(">"):
            list_of_reads_to_align.append(line.strip())

args_noheader = 'bowtie -f --sam-nohead --best --strata -m 10 -S rn4 -c {0}'
args = 'bowtie -f --best --strata -m 10 -S rn4 -c {0}'

for counter, read in enumerate(list_of_reads_to_align):
    if counter != 0:
        cmd = subprocess.Popen(args_noheader.format(read), shell = True, stdout = subprocess.PIPE, close_fds=True, stderr = subprocess.PIPE)
    else:
        cmd = subprocess.Popen(args.format(read), shell = True, stdout = subprocess.PIPE, close_fds=True, stderr = subprocess.PIPE)

    for line in cmd.stdout:
        print(line, end="")