Question

Analyzing Unaligned Sequence (From Bowtie)

1

Entering edit mode

12.1 years ago

newDNASeqer ▴ 790

From my RNASeq data, Bowtie aligned 80-90% of the total reads using human genome as the reference, and I am interested in taking a look at the unaligned sequences. Because the RNASeq samples were extracted from humanized mice that has human tumor transplanted, the tissue could be mixed with mice mRNA.

I want to see how much of the unaligned sequences derived from mice and how much of them is from human. Is there any tools that I can use? or I will need to create my own code to read each unaligned sequence and do sequence blast?

bowtie • 5.6k views

ADD COMMENT • link updated 12.1 years ago by swbarnes2 15k • written 12.1 years ago by newDNASeqer ▴ 790

0

Entering edit mode

doesn't bowtie have a command to output unaligned sequences? that should make it easy.

also, are you using bowtie directly or bowtie via tophat?

ADD REPLY • link 12.1 years ago by brentp 24k

0

Entering edit mode

I used bowtied via tophat, and it generated a file named "unmapped.bam" file. Within this unmapped.bam file, I want to get information what percentage of the unaligned reads are human and how many are mice. Does bowtie have this functionality? thanks

ADD REPLY • link 12.1 years ago by newDNASeqer ▴ 790

score 2 · Answer 1 · 2013-07-25

2

Entering edit mode

12.1 years ago

swbarnes2 15k

If you knew the samples were mixed, you could have aligned to a combination of mouse and human reference. That's the best solution; if a read matches mouse perfectly, and human with 2 mismatches, you want it to align to mouse, not to be forced to align to human.

I'd start by doing a simple count of each sequence represented in the unmapped reads. Something like

samtools view unmapped.bam | cut -f 10 | sort | uniq -c | sort -nr > sorted_reads.txt

Will make you a list of each read, and how often it turns up. Obviously it will separate reads that differ from each other by an error, which is not ideal, but this is at least a place to start. If 90% of the reads are the exact same sequence, you want to know that before you start blasting each one individually.

You could also try de novo assembly on the unmapped reads. It won't do much if your reads are scattered across a mammalian genome, but if they are something else, like mouse mitochondrial genes, it could help.

But before you make a program to BLAST each one, just spot-check a few. 80-90% aligned seems alright to me, you might just be looking at noise.

ADD COMMENT • link 12.1 years ago by swbarnes2 15k

0

Entering edit mode

Thank you for the reply. Is the command "samtools view unmapped.bam | cut -f 10 | sort | uniq -c | sort -nr > sorted_reads.txt" outputing all the unmapped reads, only once for each read? I am new to this tool, how do I compare how often each read turns up?

ADD REPLY • link 12.1 years ago by newDNASeqer ▴ 790

0

Entering edit mode

What that command is is samtools at the head, and lots of unix commands daisy-chained afterwards, so that you don't have intermediate files.

samtools view converts the .bam into plain text. That gets piped into cut, which takes the contents of the 10th column, which is the sequence. The first sort sorts all those sequences alphabetically, uniq -c then compresses the list, so that you see each read, and how many times that exact read appears. The second sort sorts by that count number, so at the end, the top of your file contains the most abundant sequences, and how many times those sequences appears in your .bam

As written, that command will count up every read in the file you give it. If your file is a mix of mapped and unmapped reads, using "-f4" option in samtools view will cause samtools to skip the mapped reads.

ADD REPLY • link 12.1 years ago by swbarnes2 15k