Mapping RNA-Seq data on genome from another species
2
1
Entering edit mode
7.5 years ago
tlorin ▴ 370

Hello all,

I have RNA-Seq data from a species A for which I don't have a reference genome. I would like to map this data on the closest available genome from species B.

I have read this post and I am planning to use hisat2 or STAR. Does any of you have recommandations regarding the parameters to set for mapping, given that I don't have a reference genome? I would expect that I should use this program with less "stringency" than if I could map on the species A genome. For instance, the default parameter in STAR is --outFilterMismatchNmax 10: should I set it to 20? 30? In hisat2, it is --score-min L,0,-0.2.

My reads are 100nt.

In addition, could you recommend any lecture related to this question?

Many thanks for your answers!

hisat2 RNA-Seq genome mapping star • 5.2k views
ADD COMMENT
2
Entering edit mode

Here you have a very good review about how to do NGS analysis with non-model organisms. It has a part on RNA-seq and how to deal with situations were you do not have a reference genomeNext-generation biology: Sequencing and data analysis approaches for non-model organisms

ADD REPLY
1
Entering edit mode
7.5 years ago
jnoble333 ▴ 20

Hi tlorin,

I use STAR to do this frequently with the parameter --outFilterMismatchNmax 8. You can check the alignment percentage in the Log.final.out file and adjust your mismatch threshold accordingly. You may face the issue of reads aligning to a lot of different locations and can filer your bam file by the mapq score to address this. Be aware that unaligned reads may be placed in the "% of reads unmapped: too short" category even if they are unaligned because of mismatches. Dobin said something about this in one of the threads for the STAR google group.

ADD COMMENT
0
Entering edit mode

Thanks for your answer! But why --outFilterMismatchNmax 8and not --outFilterMismatchNmax 15 or over threshold? Is this some arbitrary threshold that you found was best? I guess this depends on the evolutionary distance between both species too. Is this the post you refer to?

ADD REPLY
0
Entering edit mode

That is the post I went by. That threshold works for my dataset for the most part because I have a lot of reads. I'd say try different values and see how your results differ.

ADD REPLY
1
Entering edit mode
7.5 years ago

BBMap will align RNA-seq data to genomes with a substantially lower identity than most other aligners, particularly other splice-aware aligners. You might add the flags "maxindel=200k minid=0.7" to increase alignment rate in this case.

ADD COMMENT
0
Entering edit mode

Thanks Brian! If I add maxindel=200k minid=0.7 do I risk mapping to spurious locations or will the mapping report the best unique hit (so, the homologous sequence in my case)?

ADD REPLY
1
Entering edit mode

With looser requirements, you can have a greater risk of mapping to spurious locations, but in that case they will have a lower mapq score. BBMap will always map reads to the best-matching location. So, "minid=0.9" and "minid=0.8" and "minid=0.7" will all report the same alignment for a given read, if the best alignment has, say, 95% identity. But if the best alignment only has 85% identity, then "minid=0.9" won't report anything.

ADD REPLY

Login before adding your answer.

Traffic: 2307 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6