Hi, I am trying to analyze a RNA seq dataset with ERCC spike ins. I am using Tophat because I am also supposed to look at fusion genes and so far I haven't found a proper pipeline for that!! (Yes I know STAR has STAR fusion but STAR just crashed (32GB). I don't think HISAT has fusion detection functionality!!)
I am trying to run tophat for the human dataset. I merged the ERCC gtf file with Human gtf file. Also merged the ERCC fa file with Human fa file and then indexed it..then running tophat with the trimmed fastq files.
Tophat 2.1.1 Bowtie 2.2.5
tophat --no-novel-juncs --no-coverage-search -r 100 -p 8 -G Homo_sapiens.GRCh38.91_ERCC92.gtf hg38_ERCC92 forward.fastq reverse.fastq
But I getting an error saying:
Warning: Empty fasta file: './tophat_out/tmp/segment_juncs.fa' Warning: All fasta inputs were empty Error: Encountered internal Bowtie 2 exception (#1) Command: bowtie2-build --wrapper basic-0 ./tophat_out/tmp/segment_juncs.fa ./tophat_out/tmp/segment_juncs [FAILED] Error: Splice sequence indexing failed with err =1
Please suggest if is there any other tool or pipeline I can use to analyze the dataset?
Thanks, Payal
Is there a reason you want to include the ERCC spike-ins? They're usually pretty useless.
Also, trying to get a machine with more than 32GB RAM is a better option than sticking with tophat.
For now thats the only server I have..we are trying to increase our capabilities!! But for now thats all I have got to work with!!
I am sorry but I don't have an answer to why they included ERCC spikeins because neither was I not involved in the study design or the wet lab part of the experiment. All I can think of is they wanted some kind of internal controls or standards!! I was just handed over the data and now I have to figure out how to get meaningful results out of it!!
Including the spike-ins in the sequencing isn't uncommon, they tend to just get ignored once one does the analysis, since they tend to do more harm than good. I suggest using the unmodified GTF file (without the spike-ins) and see if the tophat issue goes away.
BTW, you might even be able to use usegalaxy.org to get access to enough memory.
Thanks... let me try those two options !!!
I used ERCC spike-ins with tophat using indexes from igenome....give it a try.
Using igenome to run a seq analysis with Tophat/Cofflinks.....but how do I add the ERCC sequences to the the reference transcriptome and reference genome?
Yup I did look into this post while looking for answers... another problem I found was if the genome and gtf files don’t have same annotation then it can throw errors, so I downloaded both the gtf and genome fa file from Ensemble db!!!