I have couple of questions regarding cufflinks. I have done cufflinks transcripts assembly for 12 cell lines from human.And merged all the transcript.gtf files from all samples using the reference transcript file, by cuffmerge and then did cuffdiff analysis. This analysis gave 163 genes deferentially expressed from around 60k total list of genes. The output files in the cuffdiff analysis does not give per sample read count and fpkm information so I was not sure how to validate the the significant genes by comparing the read counts in each sample. Also, the publication by trapnell group, does mention that for a well annotated organism like human , mouse etc, its not important to carry out the transcript assembly if looking at the differential expression, so I followed the Alternation protocol mentioned in that paper where the alignment is done by turning off the novel splice junction detection in tophat and then running cuffdiff analysis by skipping the transcript assembly. This analysis gave 141 genes diferentially expressed but again I don't know how to get the per sample read count and fpkm data from cufflinks pipeline. The output files of cuffdiff from my first type of analysis (transcript assembly) does not give any significant differential spliced genes. Is this normal?(I did reference guided assembly and tophat was run with novel splice junctions option on).
The cufflinks pipeline, although mentioned to be robust in many papers, does not seem to be transparent in what its doing. I am wondering if I should choose some other method of differential expression analysis than cufflinks. Any suggestions on this will be helpful.
The tuxedo pipeline is indeed not the best route to follow. I would suggest you get raw read counts from your BAM file using featureCount from subread software, and do sound statistic analysis with edgeR or limma.
That's opinion, not an answer. If you're going to suggest an alternate method at least substantiate your claim.
Yes, it was a comment not an answer, very sharp.
If you read all the questions and answers on this website which involve RPKM and sound statistics, you'll see that mostly they recommend to NOT use RPKM but raw read counts instead in combination with something like edgeR or DEseq.
But don't believe the opinion of people on this website, why should you? (I am being sarcastic in case it is not clear). There is also literature available, e.g., http://www.genomebiology.com/2013/14/9/R95
The OP is clearly not experienced, or would not be asking the question or following an out of date paper, that most people will attempt to use whilst developing their RNA-Seq analysis skills. Asking you to substantiate your comments was an attempt to help the OP, not have a go at you.
I am still waiting for some answers :( . Thanks b.nota for the comments.