Question

Tophat multiple or unique mapping criteria

0

Entering edit mode

7.3 years ago

maple964 • 0

Dear all,

I am very confusing while using mapping RNA-seq to Arabidopsis genome.

Basically, I want to know what to consider while setting threshold about mapping quality. I have followed several infomation like this, http://biofinysics.blogspot.tw/2014/05/how-does-bowtie2-assign-mapq-scores.html.

What to consider while choosing multiple mapping or unique mapping criteria?

However, I failed to find a proper function in Tophat.

Is it possible to set the threshold 95% of alignment identity?
What does the mode of --very-fast or --very-sensitive mean in tophat or bowtie2 manual?

Thank you all in advance

RNA-Seq • 2.5k views

ADD COMMENT • link updated 7 months ago by RT • 0 • written 7.3 years ago by maple964 • 0

score 4 · Accepted Answer · 2017-09-15

Tophat uses bowtie2 as mapping engine. In the tophat2 help:

  Scoring options
    --b2-mp                        <int>,<int> [ default: 6,2              ]
    --b2-np                        <int>       [ default: 1                ]
    --b2-rdg                       <int>,<int> [ default: 5,3              ]
    --b2-rfg                       <int>,<int> [ default: 5,3              ]
    --b2-score-min                 <func>      [ default: L,-0.6,-0.6      ]

What to consider while choosing multiple mapping or unique mapping criteria?

Remember that these algorithms won't find the best mapping position at the first try all the time. Through the -k parameter you can set how many mapping "attempts" the program has to do before giving up, and for example allowing for only 1 tryout doesn't imply you'll find the best solution. If the mapping seed fits perfectly in one spot but the surrounding is not what is supposed to be, with -k 1 you will retrieve only that. Perhaps there was a mapping position somewhere else with 1 mismatch in the seed that was extending the alignment better, but you didn't consider it because you asked for only one mapping attempt per read.

Another thing to point out: you probably want to ask the algorithm to try, I don't know, say 5 times before giving up, and then from the output file you select the primary alignment by excluding the secondary ones (samtools view -F 0x0100). Among 5 mapping attempts, there is most likely the best one which will be annotated as "primary".

Is it possible to set the threshold 95% of alignment identity?

The --b2-score-min parameter sets the minimum alignment score that you can allow a read to reach before discarding a mapping. The two -0.6 work like this, and can be changed: the read length is multiplied by the third number, and then the second number is added. If your reads are 91 nt long, you will have ( 91 x -0.6 ) + ( -0.6 ) = -55.2

The --b2-mp defines the mismatch penalty: the first number is used when the quality of the base is good, the second when it's bad. If you trimmed the reads before, you might as well use --ignore-quals so the program will only use the first value (6) and it gets easier to get what you want. In this case you can have at most 55.2 / 6 = 9.2 ~ 9 mismatches in one read.

--b2-rdg and --b2-rfg define the read and reference gap opening and extension penalty. In this case you can have a gap which is at most ( 55.2 - 5 ) / 3 = 16.73 ~ 16 nt long.

So all in all you can have either 9 mismatches or 16 nt gap or a combination of the two. With reads that are 91 nt long, having 9 mismatches represents ~10% of the read length and therefore it would be ~90% seq identity.

You can't set a particular percentage of sequence identity, that is done easily in BLAT if you need it, but you can refine your mapping strategy way more with this power.

What does the mode of --very-fast or --very-sensitive mean in tophat or bowtie2 manual?

If you have a closer look, these modes are basically tweaking the parameters that I just described you in such a way that the alignment becomes faster (i.e. not trying "hard" to map a read) or sensitive (i.e. trying "hard").