Question

Mapping RNA-Seq data on genome from another species

1

Entering edit mode

7.5 years ago

tlorin ▴ 370

Hello all,

I have RNA-Seq data from a species A for which I don't have a reference genome. I would like to map this data on the closest available genome from species B.

I have read this post and I am planning to use hisat2 or STAR. Does any of you have recommandations regarding the parameters to set for mapping, given that I don't have a reference genome? I would expect that I should use this program with less "stringency" than if I could map on the species A genome. For instance, the default parameter in STAR is --outFilterMismatchNmax 10: should I set it to 20? 30? In hisat2, it is --score-min L,0,-0.2.

My reads are 100nt.

In addition, could you recommend any lecture related to this question?

Many thanks for your answers!

hisat2 RNA-Seq genome mapping star • 5.2k views

ADD COMMENT • link updated 7.5 years ago by Brian Bushnell 20k • written 7.5 years ago by tlorin ▴ 370

2

Entering edit mode

Here you have a very good review about how to do NGS analysis with non-model organisms. It has a part on RNA-seq and how to deal with situations were you do not have a reference genomeNext-generation biology: Sequencing and data analysis approaches for non-model organisms

ADD REPLY • link 7.5 years ago by IP ▴ 770

score 1 · Answer 1 · 2017-05-18

1

Entering edit mode

7.5 years ago

jnoble333 ▴ 20

Hi tlorin,

I use STAR to do this frequently with the parameter --outFilterMismatchNmax 8. You can check the alignment percentage in the Log.final.out file and adjust your mismatch threshold accordingly. You may face the issue of reads aligning to a lot of different locations and can filer your bam file by the mapq score to address this. Be aware that unaligned reads may be placed in the "% of reads unmapped: too short" category even if they are unaligned because of mismatches. Dobin said something about this in one of the threads for the STAR google group.

ADD COMMENT • link 7.5 years ago by jnoble333 ▴ 20

0

Entering edit mode

Thanks for your answer! But why --outFilterMismatchNmax 8and not --outFilterMismatchNmax 15 or over threshold? Is this some arbitrary threshold that you found was best? I guess this depends on the evolutionary distance between both species too. Is this the post you refer to?

ADD REPLY • link 7.5 years ago by tlorin ▴ 370

0

Entering edit mode

That is the post I went by. That threshold works for my dataset for the most part because I have a lot of reads. I'd say try different values and see how your results differ.

ADD REPLY • link 7.5 years ago by jnoble333 ▴ 20

score 1 · Answer 2 · 2017-05-18

1

Entering edit mode

7.5 years ago

Brian Bushnell 20k

BBMap will align RNA-seq data to genomes with a substantially lower identity than most other aligners, particularly other splice-aware aligners. You might add the flags "maxindel=200k minid=0.7" to increase alignment rate in this case.

ADD COMMENT • link 7.5 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian! If I add maxindel=200k minid=0.7 do I risk mapping to spurious locations or will the mapping report the best unique hit (so, the homologous sequence in my case)?

ADD REPLY • link 7.5 years ago by tlorin ▴ 370

1

Entering edit mode

With looser requirements, you can have a greater risk of mapping to spurious locations, but in that case they will have a lower mapq score. BBMap will always map reads to the best-matching location. So, "minid=0.9" and "minid=0.8" and "minid=0.7" will all report the same alignment for a given read, if the best alignment has, say, 95% identity. But if the best alignment only has 85% identity, then "minid=0.9" won't report anything.

ADD REPLY • link 7.5 years ago by Brian Bushnell 20k