Question

Question about the appropriate FPKM gene expression processing

0

Entering edit mode

10.0 years ago

AlexAbdulkaderKheirallah ▴ 120

Hi there,

As part of my RNA-Seq project analysis I wrote a program to organize a gene expression table for subsequent analyses. The reason I wrote the program is because cufflinks/cuffnorm output (1) amalgamated some separate genes being in close proximity to each other into one XLOC id (2) some genes occurred more than once in the list and at each occurrence had a different XLOC despite having the same gene name. I figured that the possible reason for that is because the same gene may have different TSS and hence will have different XLOC id. Therefore the final situation was that the same gene would have different FPKM expression values in each XLOC.

The program I wrote (1) separates genes sharing the same XLOC and assigns to them the original FPKM that was reported by cuffnorm per XLOC (2) identifies genes that occur multiple times and averages their expressions as reported in different XLOC ids. And here is my crucial question:

I noticed that these FPKM expressions can vary significantly between XLOCs. So for example imagine this situation:

gene_id        gene_name        sample 1         sample 2
XLOC_1         funnyGen           20                     1
XLOC_3         funnyGen            2                      1

Now, if we were to average the data as I wrote the program we end up with:

gene_name      sample 1              sample 2
funnyGen             11                         1

It seems to me the data could be significantly skewed. Is your advise then to average the FPKM data or perhaps only add them to each other. The letter scenario may be better in the type of scenario described above but averaging may be better for other. And hence I'm undecided for which option to go, an option that I can apply to the whole dataset.

Thanks!

RNA-Seq • 2.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by AlexAbdulkaderKheirallah ▴ 120

0

Entering edit mode

I could solve it by providing GTF file at the time of TopHat alignment. So Tophat alignment without GTF file produced two XLOC ids for the gene of my interest. On the other hand, the one with GTF file produced only one XLOC id.

GTF file was provided at the Cufflinks stage as well but providing it at TopHat stage made the difference.

If you are not willing to run the TopHat command again, you can look for the class_code of the XLOC ids and accept the XLOC ID with class_code such as "=" (and not "x"). That is select one instead of averaging out. Don't know how to filter XLOC ID by class_code.

Hope it helps.

ADD REPLY • link 8.5 years ago by Satyajeet Khare ★ 1.6k

Ram · Answer 1 · 2014-12-10

1

Entering edit mode

10.0 years ago

Istvan Albert 101k

I think the right course of action is to count the reads with a different program like htseq-count where you have full control on what type of overlap to allow and process that with a different differential expression method. If needed computing FPKM from counts is very straightforward.

Reverse engineering the FPKMs from cuffdiff output feels a risky approach as it is not clear how these values were derived in your case.

ADD COMMENT • link 10.0 years ago by Istvan Albert 101k

0

Entering edit mode

I am not entirely sure about that but, since htseq-count allows to map the number of reads to each of the features described in the provided GTF we are faced with the same problem? I mean if I will use my predicted GTF rather than the reference GTF then again I will have the reported reads aligned to different XLOC_id attributes that may have assigned the same gene_name? Also, would you know whether the number of reads reported will be normalized for library size and gene length?

I'm going to examine the genes that had multiple XLOC attributed to them (as I said possibly due to different TSS) and if it turns out that there are only two XLOCs for all the genes considered and it looks like one is showing the predominant FPKM expression (other being negligible) then I think I can ignore the other. I just wasn't sure about the degree of FPKM variability and hence averaged the numbers but perhaps I should look more carefully at the data.

But I'm open to your suggestions.

Thank again

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by AlexAbdulkaderKheirallah ▴ 120