Hi, quick presentation: other bio-informatics students and myself are working on a RNA-seq project during summer to get our hands dirty and some experience with it. We are working on a project consisting in reproducing the RNAseq pipeline of a research team on another dataset.
We would like to move from tophat2 (used by the team we work for) to hisat2. We are both interested in this because when multithreading on a computer farm, we get a "[failed]" outcome when "writting tophat reports" and also because hisat2 is much more efficient and we would like to learn using this new software. Moreover, reproducing the pipeline strategy on another software could further strengthen its "proof of concept".
However, we face difficulties setting up parameters.
It would be very kind of you, if you could give us some guidance on how to reproduce these tophat2 parameters on hisat2:
(all other parameters as default):
“--min-intron-length 10 --max-intron-length 20000 --read-mismatches 3 -- read-gap-length 2 --read-edit-dist 3 --max-multihits 2 --b2-sensitive --segment-mismatches 2 -- segment-length 15 --min-segment-intron 10 --max-segment-intron 20000 --no-coverage-search”.
Another run using:
“--read-gap-length 1000 --read-edit-dist 1003 --b2-ma 3 --b2-rdg 3,1”
I understand that it is a bit much to ask, but that is an obstacle ( in a very early step of the pipeline). Hisat2 parameters are very cryptic for us yet. So if you could even just explain some underlying concepts that could help us do it ourselves it would be very nice!
Thanks in advance
Take a look at Simulation-based comprehensive benchmarking of RNA-seq aligners and see if it helps (indirectly).
As a general rule of thumb for most bioinformatics tools: the default settings should be reasonable for standard situations. Only when your dataset is "different" you can start fiddling around with parameters.