Hi there,
As part of my RNA-Seq project analysis I wrote a program to organize a gene expression table for subsequent analyses. The reason I wrote the program is because cufflinks/cuffnorm output (1) amalgamated some separate genes being in close proximity to each other into one XLOC id (2) some genes occurred more than once in the list and at each occurrence had a different XLOC despite having the same gene name. I figured that the possible reason for that is because the same gene may have different TSS and hence will have different XLOC id. Therefore the final situation was that the same gene would have different FPKM expression values in each XLOC.
The program I wrote (1) separates genes sharing the same XLOC and assigns to them the original FPKM that was reported by cuffnorm per XLOC (2) identifies genes that occur multiple times and averages their expressions as reported in different XLOC ids. And here is my crucial question:
I noticed that these FPKM expressions can vary significantly between XLOCs. So for example imagine this situation:
gene_id gene_name sample 1 sample 2
XLOC_1 funnyGen 20 1
XLOC_3 funnyGen 2 1
Now, if we were to average the data as I wrote the program we end up with:
gene_name sample 1 sample 2
funnyGen 11 1
It seems to me the data could be significantly skewed. Is your advise then to average the FPKM data or perhaps only add them to each other. The letter scenario may be better in the type of scenario described above but averaging may be better for other. And hence I'm undecided for which option to go, an option that I can apply to the whole dataset.
Thanks!
I could solve it by providing GTF file at the time of TopHat alignment. So Tophat alignment without GTF file produced two XLOC ids for the gene of my interest. On the other hand, the one with GTF file produced only one XLOC id.
GTF file was provided at the Cufflinks stage as well but providing it at TopHat stage made the difference.
If you are not willing to run the TopHat command again, you can look for the class_code of the XLOC ids and accept the XLOC ID with class_code such as "=" (and not "x"). That is select one instead of averaging out. Don't know how to filter XLOC ID by class_code.
Hope it helps.