Hi,
I am willing to filter the highly transcribed genes of my experiment based on their fpkm
values. First I need a confirmation whether that is a correct strategy?
If so I am filtering from genes.fpkm_tracking
of the cuffdiff
outputs not the rest (cds, isoforms or tss). I hope that is correct also.
Then I am wondering why several different genes show up with exact fpkm
value in the genes.fpkm_tracking
file? Can some body explain that please? And let's assume initially I select the 500 highest fpkm
s, then I split those genes and I end up with 1000 genes. Now my aim was to filter the top 500 highly transcribed genes, that 1000 gene result will ruin it so I have to select for less genes for a second time! Is that how it works?
Thanks in advance!
Thanks for explanation on why genes can have same FPKM. Previously I filtered on counts produced from DESeq2 but then I thought maybe its more accurate to do it on FPKM and compare the outcome. I have no idea about problems surrounding FPKMs, so if you could clarify or share a link I appreciate that. So your recommendation is to stick to the counts? However I am not familiar with raw/expected counts term. A newbie here!
By splitting I meant splitting the string characters. Now you don't need to consider it, you gave the answer already.
Thanks a lot!
One of the biggest problems with FPKMs is that they're easily skewed by differences in abundance of a single highly expressed gene. Since samples will have different amounts of rRNA depletion and rRNAs are the majority of transcripts this can create real problems. If you calculate FPKM from normalized counts, then this isn't a problem.
The other issue with FPKMs is that by normalizing by gene length, you're throwing away genes for which you have more statistical power.
Expected counts are produced by programs like RSEM and eXpress (cufflinks uses them, but I don't know if it actually outputs them). They're essentially just standard/raw counts that include ambiguous alignments via an expectation maximization algorithm. Read the papers on any of the aforementioned tools for details.
Yes, sticking with counts is typically the most useful route. I did have one experiment where making FPKMs from the normalized counts and filtering according to that had the most power, but that's more the exception than the rule.
Thanks for the very thorough explanation Devon. However that made me to come up with more questions!
I would like to ask how can I calculate FPKM from normalized counts? The only way I know to make normalized counts is using DESeq package but then I could have your advice how to calculate back the FPKM from it.
Next thing regarding eXpress! Now I am trying it but since I work on S. pombe and all the analysis I did so far was on a workstation, I am doing eXpress the same but it seems it is taking a very loooooong time (over-night and running) creating
hits.bam
which I assume is equal toacceptedhits.bam
from Tuxedo. The file is very big and is expanding. Its 40GB right now in eXpress comparing to about 0.5GB in Tuxedo output. Is it the way it works?Thank you so much for all the helps.
From normalized counts, just divide by the length of each gene and then by 1 million.
Regarding express, that seems a bit over the top in terms of file size. You might look through that file and see if it's just writing gibberish or repeating alignments or something like that.
Thanks for the guidance. I tried to create FPKM from counts but I got some questions that if you would read what I've done below I would like to ask.
Then here the values I get for
counts.sum
vary between hundreds to several thousands. Therefore, when I divide them to gene-length and then a million the resulted FPKM will become so small. Is it correct? However this doesn't matter if I just want to sort them and pick the highly transcribed genes but in case of expectations for those values I am asking.I have another question regarding creating gene-length. I am wondering if this will do for creating them since I am not very familiar with txdb s.
Please let me know what you think and thanks in advance.
Remember that gene length is in kilobases, otherwise you'll get 1000x smaller numbers than you should. Anyway, in general, RPKM values will be much smaller than the normalized (or raw) counts. The exception is short genes (namely, anything <1kb).
Here is an R script that computes union-gene model gene lengths (among other things). I would personally follow that route. You'd just use the txdb for your fungus to get the exons split by gene, reduce that and then sum the widths.
Thanks Devon. I had GTF and produced the lengths with your codes and the output was similar to above mentioned codes if it is changed as below. Just in case some one else is reading this and interested for a shorter way.