Question

Tophat 1.4 Output (Accepted_Hits.Bam)

0

Entering edit mode

12.5 years ago

thecuriousbiologist ▴ 550

I have read the Tophat 1.4 manual, however I do not seem to understand the outputs very well.

I used a reference genome for my alignment with Tophat and I got a file called "accepted_hits.bam"

This file's cigar strings don't contain any soft clips (S). Can someone explain me why this is happening ?

Also, this was for the reference genome. If I wanted to align to a junction database, should I just use the junction reference instead of the genome reference ? Or does Tophat align to junctions automatically ?

Very confused with the documentation.

tophat output bam • 4.4k views

ADD COMMENT • link updated 10.6 years ago by Biostar 20 • written 12.5 years ago by thecuriousbiologist ▴ 550

score 1 · Answer 1 · 2012-10-08

1

Entering edit mode

12.5 years ago

Ido Tamir 5.2k

tophat splits the reads, but still requires everything to match. You could have mismatches in the beginning or end, but these would be scored as M in the CIGAR string. You have to look in the MD tag for mismatches. Old tophat versions dont have the optional MD tag. You can generate it with samtools calmd.
You always need a reference sequence. You can use a junction file or a GTF file in addition to this. Tophat automatically tries to align reads that can not be mapped completely by aligning them in a split fashion, thus discovering junctions. You can turn this off, modify it with parameters or supply your own junctions. But you should also have a junctions.bed file in your output and see some reads with a CIGAR like 20M500N50M.

ADD COMMENT • link 12.5 years ago by Ido Tamir 5.2k

0

Entering edit mode

Can the 500N part of the CIGAR contain soft clips ?

ADD REPLY • link 12.5 years ago by thecuriousbiologist ▴ 550

0

Entering edit mode

500N is a gap tophat is marking an intron

ADD REPLY • link 11.0 years ago by karl.stamm 4.1k