I have around 120GB ran-seq data. There is a draft genome available for organism of my interest but no standard annotations. I wanted to do differential expression analysis.
The approach I have used is:
- Run
tophat2
on all samples. - Run cufflinks to get a
transcripts.gtf
file for each sample, which is supposed to have assembled isoforms structures. - Run cuffmerge on all gtf files to create a
merged.gtf
file which is kind of master gtf file of all isoforms possible. - Run
cuffdiff/edgeR/DEseq
for differential expression analysis using themerged.gtf
. - Convert the
merged.gtf
to fasta file of transcripts usinggffread
(of tuxedo suite) and annotate all the transcripts using Annocript and append this annotation information tocuffdiff
output to know the function of differentially expressed transcripts.
I would like to know wether this approach is ok and would like to hear suggestions to improve the pipeline. I have doubt regarding the merged.gtf file generated in step 4, can I blindly depend on this file assuming it as standard gtf file like that of ensemble ? But I do not see any alternative approach ( tools like StrigTie
does the similar job ). I want to use merged.gtf
as a standard annotation file like that of ensemble to run other tools/pipelines of my interest which deals with differential splicing between different conditions. Can I rely on merged.gtf
?
On a side note, I wouldn't go for cuffdiff for DE analysis, considering lack of power it has to detect most of the genes. Count based method tend to work well, like deseq2 or egdeR
You might want to try Ballgown, which is more sensitive, allows for modeling covariates, and works "natively" on transcriptome assemblies from Cufflinks.
Noted and modified.