I have read the Tophat 1.4 manual, however I do not seem to understand the outputs very well.
I used a reference genome for my alignment with Tophat and I got a file called "accepted_hits.bam"
This file's cigar strings don't contain any soft clips (S). Can someone explain me why this is happening ?
Also, this was for the reference genome. If I wanted to align to a junction database, should I just use the junction reference instead of the genome reference ? Or does Tophat align to junctions automatically ?
tophat splits the reads, but still requires everything to match. You
could have mismatches in the beginning or end, but these would be
scored as M in the CIGAR string. You have to look in the MD tag for mismatches. Old tophat versions dont have the optional MD tag. You can generate it with
samtools calmd.
You always need a reference sequence. You can use a junction file or a GTF file in addition to this. Tophat automatically tries to align reads that can not be mapped completely by aligning them in a split fashion, thus discovering junctions. You can turn this off, modify it with parameters or supply your own junctions. But you should also have a junctions.bed file in your output and see some reads with a CIGAR like 20M500N50M.
Can the 500N part of the CIGAR contain soft clips ?
500N is a gap tophat is marking an intron