When we are using the cuffnorm program from the cufflinks suite and have a gtf file (say, x.gtf) and three BAM files (1.bam, 2.bam and 3.bam) we can call
cuffnorm x.gtf 1.bam 2.bam
and
cuffnorm x.gtf 1.bam 3.bam
When we consider the gene FPKM values, we obtain for the two calls outputs like
tracking_id q1_0 q2_0
ENSG00000000003.10 2.59667 32.8815
ENSG00000000005.5 0 0
ENSG00000000419.8 68.1701 76.2395
...
and
tracking_id q1_0 q2_0
ENSG00000000003.10 2.76372 14.1348
ENSG00000000005.5 0 0
ENSG00000000419.8 72.5559 38.017
...
Of course, the last gene expression column is related to two different BAM files, which explains its variation. However, the first gene expression column always corresponds to 1.bam, so in principle, we would expect it to be identical for both outputs. We do see, however, some variation.
Our questions are now: Why is that so? Is there some way to bypass that?
Many thanks for your help in advance!
Hi Devon, many thanks for you comment.
Is there a way to obtain "abolute" FPKM values for each RNA-seq BAM file? If necessary, we might take another program than cufflinks, though after all I have heard cufflinks is pretty good for such purposes.
You could just change the library normalization method to "classic-fpkm". Keep in mind that you then can't directly use the values for statistics.
Hi Davon, what is meant here by statistics?
Most people doing RNAseq want to look at differential expression and things like that. You can't reliably do those things (i.e., perform any comparative statistics) on raw FPKMs.
That's very interesting. I have to analyze roughly 1000 RNA-seq datasets and each BAM file has a huge size (median 20 GB). What would be your way to the analyze all these files?
It depends on the organism.
It is human, hg19 annotation. Or would you suggest an entirely different software with which I can perform a gene expression analysis. In the end, after I processed these data I want to compare them to other already processed FPKM data from the TCGA consortium.
If you intend to do the typical differential expression analysis, the just run featureCounts on the BAM files. You'll get the raw counts needed for downstream statistics from that.