I'm running cufflinks on some data produced from a viral system using Ion Proton sequencing. I've been running several different RNA-Seq analysis workflows on this data. In particular this set is aligned with STAR + Bowtie2. I've also done HiSat2 + StringTie + Ballgown analysis on this data since that is sort of the replacement workflow for the traditional Tuxedo suite.
However I am running into an issue with the Cufflinks data. When running using:
cufflinks -g reference.gff -b reference.fa -u -p 24 --library-type fr-secondstrand sample.merged.sorted.bam
I end up not getting any results. Both skipped.gtf and transcript.gtf are empty, the tracking files just have the header.
Output to the screen:
You are using Cufflinks v2.2.1, which is the most recent release.
[04:39:37] Loading reference annotation. [04:39:37] Inspecting reads and determining fragment length distribution. Processed 1 loci. [*************************] 100%
Map Properties:
Normalized Map Mass: 16399479.00
Raw Map Mass: 16399479.00
Fragment Length Distribution: Truncated Gaussian (default)
Default Mean: 200
Default Std Dev: 80 [04:41:49]
Assembling transcripts and estimating abundances.
Processed 1 loci. [*************************] 100%
[04:44:31] Loading reference annotation and sequence.
[04:44:31] Learning bias parameters.
Processed 0 loci. [*************************] 100%
[04:45:50] Re-estimating abundances with bias correction. Processed 0 loci.
I tried setting -max-bundle-frags to a higher number; however then cufflinks gets stalled at 99% complete, waiting for 1 thread to finish for about a day or so, which normally these runs finish in a few minutes. I'm using a custom GFF file but I've also tried using another one that exists for this reference sequence, tried both GFF and GTF files. The gffread utility doesn't show any problems with these files. I've tried not using the -u flag.
Any suggestions what might be going on? The same dataset of Ion Proton reads has also been aligned against the human genome (since the viral system is infecting human cell culture) and that seems to work fine.
Sorry did not read the full post. Please ignore this. Cufflinks takes a gtf. You seem to have given a gff. Although don't know for sure if that is causing it.
Cufflinks will take both a GTF or a GFF. They mix and match which they specify in various in instructions but also, specifically for -g specify it as gff/gtf. Also, as described I tried both the GFF and GTF versions of my annotation files with the same results.
My apologies. I replied too fast without reading the entire post. I could not find any option to delete the post soon after posting. On another note, cufflinks has a --verbose option, just in case it hints to what is going on.
Trying that now. Not sure why I didn't even thing of that for more debugging info
I get some GFF Warnings (which oddly I don't get when I use gffread on the file). It seems to be something that happens around the second time it loads the reference annotation:
Can you please also post a few lines of the vcf. Just in case is the strand in gtf a '+' and '-' and not set to a '.'. ?
Do you mean the GTF (there is no VCF). There is strand information in the GTF files (+ and -) for all of the features.
Ah my bad. Turns out it was a wrong lead. I was trying to grep the error message 'Kept \d+ transfrags out of 0' in the cufflinks source. Apparently the fault lies elsewhere.
I've not had any luck trying to calculate DE on viral expression with cufflinks. I've tried everything from the NCBI provided GFF/GTFs to making my own from scratch (not hard for ssRNA viruses) without success. I think that the way viruses are represented in GFF/GTFs is somehow valid to gffread but causes problems to parts of cuff* tool set. Then again I've also had instances with mammalian GTF/GFFs going through gffread without issue but parts of cuff* throwing a fit, so I don't think gffread is a completely reliable means for checking.
PS: What is a "viral system"?
I was just being non-specific about the virus, etc. But it is human cells infected with virus. So the mRNAs captured are a mix of virus transcripts and human transcripts. Viral system seemed an apt description for the experiment and set up.
And thanks for the input. It definitely seems like it is some issue with the annotations and how they are interpreted.
Ah, I thought you had something funky going on, like a system of multiple viral species. The cufflinks verbose output gave away the virus ;).
Have you tried deleting or manually fudging the regions cufflinks is complaining about? I've tried that without success, maybe it will work for you.
Here's a paper I had looked at previously, but haven't had a chance to try: http://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1003896#s4
Yes, I'm now looking at featureCounts + DESeq2, so similar to HTSeq + edgeR