Dear community,
A similar question was asked previously on SeqAnswers and no answers were posted. I'll expand that question.
I'm trying to better understand the cufflinks --> cuffdiff workflow. Once I run cufflinks on each of my .bam files (from tophat), I have a separate .gtf assembly for each sample. To run cuffdiff I need a single unified .gtf file of my assembled transcripts.
If I want to run a differential expression analysis with cuffdiff, should I use the merged.gtf file produced by cuffmerge or the combined.gtf file produced by cuffcompare? How are these two files different, and what would be the downstream effect of using one or the other for differential expression in cuffdiff?
EDIT: Or would a better workflow be to forego cuffmerge/cuffcompare altogether in favor of running cufflinks on a merge of all the .bam files to generate a single assembly that maximizes assembly accuracy, and use this as the "reference" for cuffdiff? (E.g. samtools merge)
...
More info:
From the cuffcompare documentation:
Cuffcompare clusters/tracks transfrags across samples, and writes a GTF file <outprefix>.combined.gtf containing a nonredundant set of transcripts across all input files (with a single representative transfrag chosen for each clique of matching transfrags across samples).
From the cuffmerge documentation:
cuffmerge takes two or more Cufflinks GTF files and merges them into a single unified transcript catalog. Optionally, you can provide the script with a reference GTF, and the script will use it to attach gene names and other metadata to the merged catalog.
what if one is not looking for novel isoforms, is it essential to merge all the gtfs with reference gtf when we are just interested in known isoforms expression level?