Question

Detecting Novel Spliceforms from RNAseq data- Will STAR>RSEM work or do I need to use something else?

0

Entering edit mode

4.8 years ago

Jen ▴ 30

I am trying to determine the best RNAseq analysis pipeline to use to identify novel spliceforms (I also care about non-coding RNAs). I have an enormous RNAseq dataset which I have already analyzed using STAR to map and RSEM to quantify. My data is in .fastq, generated from stranded library, ribodepletion method, 60M PE reads/sample, 100bp reads and from mouse. I thought I read that RSEM was not able detect novel spliceforms (am I wrong??). The pipeline I am thinking would work for this is Hisat2 > Stringtie > Ballgown. My questions are: (1) Can I use my current pipeline (STAR > RSEM) to identify novel spliceforms using special run parameters or do I need to redo the analysis with a different pipeline. (2) If a different pipeline would be better for this, which pipeline would people recommend and what options would you use for mapping and quantification.

My boss and I never discussed wanting to identify novel spliceforms and now he has a grant due and wants this data ASAP, so I'm on a timecrunch! Also, I've only been doing bioinformatics for two years and have taught myself, so I apologize if anything doesn't make sense. Please ask for clarification if needed. Your advice is greatly appreciated.

RNA-Seq • 1.8k views

ADD COMMENT • link updated 4.8 years ago by Antonio R. Franco ★ 5.2k • written 4.8 years ago by Jen ▴ 30

0

Entering edit mode

If you have already your Illumina reads, and try to map to the reference genome, you need an splice aware mapper such as HISAT2 or STAR to unravel the junctions. You are comparing reads coming from mature RNA without introns with a reference genome that have them.

ADD REPLY • link 4.8 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

So right now, I have STAR output which was generated by mapping my reads to GRCm38.dna.primary_assembly.fa generated by-

STAR --genomeDir star --readFilesIn Sample1_Forward.fq Sample1_Reverse.fq --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM --runThreadN 16 --outFileNamePrefix "Sample1_star/"

The output files are: Aligned.sortedByCoord.out.bam
Log.out
_STARgenome Aligned.toTranscriptome.out.bam Log.progress.out _STARpass1 Log.final.out
SJ.out.tab
_STARtmp

Then normally I would use RSEM on the STAR output file- Aligned.toTranscriptome.out.bam

RSEM-1.3.1/rsem-calculate-expression --bam -p 16 \ --paired-end --forward-prob .5 \ Sample1_star/Aligned.toTranscriptome.out.bam \ rsem/GRCm38 Sample1_rsem/rsem >& \ Sample1_rsem/rsem.log

The output files for RSEM are: rsem.genes.results
rsem.log
rsem.transcript.bam rsem.isoforms.results rsem.stat

I've been using the rsem.genes and rsem.isoform files so far for analysis. I was assuming that these files only contain known isoforms. Do my RSEM results already contain information on novel spliceforms? And I just am unaware of how to access it? Sorry if any of this is obvious. Also sorry for the formatting of this reply. I'm still getting used to doing it.

ADD REPLY • link 4.8 years ago by Jen ▴ 30

score 0 · Answer 1 · 2020-12-17

This would not be told a year ago...

I would go to run IsoSeq sequencing with PacBio HiFi to answer this question

Why?

Prices for PacBio sequencing has dropped dramatically. If can be pretty similar to that of Illumina nowdays
IsoSeq sequencing involves the true sequencing of your RNA population, and not statistical inference is required that in many cases lead to false data. I mean that you end sequencing the whole mRNA, from the beginning to the end. The actual sequence of your RNA is obtained and with HiFi reads, with quality values that exceed those of Illumina

That way you get rid of using the mapping of your lectures