I am trying to determine the best RNAseq analysis pipeline to use to identify novel spliceforms (I also care about non-coding RNAs). I have an enormous RNAseq dataset which I have already analyzed using STAR to map and RSEM to quantify. My data is in .fastq, generated from stranded library, ribodepletion method, 60M PE reads/sample, 100bp reads and from mouse. I thought I read that RSEM was not able detect novel spliceforms (am I wrong??). The pipeline I am thinking would work for this is Hisat2 > Stringtie > Ballgown. My questions are: (1) Can I use my current pipeline (STAR > RSEM) to identify novel spliceforms using special run parameters or do I need to redo the analysis with a different pipeline. (2) If a different pipeline would be better for this, which pipeline would people recommend and what options would you use for mapping and quantification.
My boss and I never discussed wanting to identify novel spliceforms and now he has a grant due and wants this data ASAP, so I'm on a timecrunch! Also, I've only been doing bioinformatics for two years and have taught myself, so I apologize if anything doesn't make sense. Please ask for clarification if needed. Your advice is greatly appreciated.
If you have already your Illumina reads, and try to map to the reference genome, you need an splice aware mapper such as HISAT2 or STAR to unravel the junctions. You are comparing reads coming from mature RNA without introns with a reference genome that have them.
So right now, I have STAR output which was generated by mapping my reads to GRCm38.dna.primary_assembly.fa generated by-
STAR --genomeDir star --readFilesIn Sample1_Forward.fq Sample1_Reverse.fq --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 16000000000 --outSAMunmapped Within --twopassMode Basic --outFilterMultimapNmax 1 --quantMode TranscriptomeSAM --runThreadN 16 --outFileNamePrefix "Sample1_star/"
The output files are: Aligned.sortedByCoord.out.bam
Log.out
_STARgenome Aligned.toTranscriptome.out.bam Log.progress.out _STARpass1 Log.final.out
SJ.out.tab
_STARtmp
Then normally I would use RSEM on the STAR output file- Aligned.toTranscriptome.out.bam
RSEM-1.3.1/rsem-calculate-expression --bam -p 16 \ --paired-end --forward-prob .5 \ Sample1_star/Aligned.toTranscriptome.out.bam \ rsem/GRCm38 Sample1_rsem/rsem >& \ Sample1_rsem/rsem.log
The output files for RSEM are: rsem.genes.results
rsem.log
rsem.transcript.bam rsem.isoforms.results rsem.stat
I've been using the rsem.genes and rsem.isoform files so far for analysis. I was assuming that these files only contain known isoforms. Do my RSEM results already contain information on novel spliceforms? And I just am unaware of how to access it? Sorry if any of this is obvious. Also sorry for the formatting of this reply. I'm still getting used to doing it.