I just want to make it clear.
I need to calculate FPKM. I use this formula: Normalized = [(raw_read_count)(10^9)] / [(gene_length)(XXXX)],
XXXX = the count of all reads that are aligned to protein-coding genes in that alignment.
How should I calculate XXXX? Is it just sum of all raw_read_counts after htseq-count (e.g. in R it will be XXXX <- sum(collumn_with_raw_red_counts)?
Thanks!
Try this solution,
PERL solution:
https://github.com/santhilalsubhash/rpkm_rnaseq_count
R solution:
A: How to normalise read count per gene
Thanks, but the question is about how to get Total count of all reads. Is it just sum of all counts? I ask it because I am trying to duplicate the results on GDC data portal and I can't do it because their Total count of all reads smaller than mine (which I calculate by sum())
Yes. Sum of all counts from individual samples (library size).
It's "fragments per kilobase transcript per million reads", so you should divide by million, not multiply.
That said, are you sure FPKMs is something you need? I don't know your application, but for many purposes, there are better normalisation methods.
This is formula with effective length, you can see why I multiply by 10^9, not devide.
I am trying to duplicate the results on GDC data portal.
Ugh, you're totally right. Shame on me!
Can you clarify
duplicate
part? Unless you are using the exact versions of software/genome build that may not be realistically possible.I am using exact versions software and genome like TCGA. And everything is good (I mean results from TCGA and my results after mRNA pipeline are identic. Except for N value for calculating FPKM (the count of all reads that are aligned to protein-coding genes in that alignment)
Is it possible that for paired end reads (fragments) they divided total reads by 2 ? (I know it is silly to think like that but can you match your numbers divided by two with TCGA total library sizes)
OR
While calculating FPKM (raw_reads/2) per gene.
Thanks for your guess, but it doesn't work that way. My N values differs by 0.03x - 0.08x (x = TCGA N values).
Aligners may produce non-deterministic output (unless they are able to accept a seed). Perhaps that is what is causing this difference.