Question

Calculating fpkm from raw counts

0

Entering edit mode

6.6 years ago

vinayjrao ▴ 250

Hello,

I have a file consisting of raw counts. For my analysis, I want to normalize the counts by fpkm. The question is, how can I calculate fpkm from raw counts?

Also, upon a little reading I found that DESeq2 could do i, but it requires the file to be in DESeq format. Could someone guide me on what the format should look like?

Thanks in advance.

Edit: Thank you for pointing it out. Values of raw counts changed! (I had mistakenly copied the rpkm values)

P.S. Below is an example of my raw counts file -

gene      sample 1      sample 2     sample 3

A1BG    7589      8458      7945

A1CF    513      718      748

A2M     7648      8998      9778

RNA-Seq R fpkm • 23k views

ADD COMMENT • link updated 3.2 years ago by TheVavish ▴ 10 • written 6.6 years ago by vinayjrao ▴ 250

1

Entering edit mode

Relevant link: https://bioinformatics.stackexchange.com/questions/367/how-exactly-is-effective-length-used-in-fpkm-calculated

ADD REPLY • link 6.6 years ago by zx8754 12k

score 3 · Answer 1 · 2021-04-02

3

Entering edit mode

3.7 years ago

Ahmed Alhendi ▴ 240

You can use countToFPKM package. This package provides an easy to use function to convert the read count matrix into FPKM matrix; following the equation in $enter image description here$

The fpkm() function requires three inputs to return FPKM as numeric matrix normalized by library size and feature length:

counts A numeric matrix of raw feature counts.
featureLength A numeric vector with feature lengths that can be obtained using
biomaRt.
meanFragmentLength A numeric vector with mean fragment lengths,
which can be calculate with Picard using CollectInsertSizeMetrics.

See https://github.com/AAlhendi1707/countToFPKM

ADD COMMENT • link 3.7 years ago by Ahmed Alhendi ▴ 240

1

Entering edit mode

Can I use MEAN_INSERT_SIZE from picard as the meanFragmentLength directly ? Should I add the length of the adapter sequence that I used for trimming to the MEAN_INSERT_SIZE ? Please clarify on this.

ADD REPLY • link 3.2 years ago by TheVavish ▴ 10

score 0 · Answer 2 · 2018-04-30

0

Entering edit mode

6.6 years ago

caggtaagtat ★ 1.9k

Hi vina,

This is a quiet helpful article about normalized counts.

According to it, FPKM are calcualted like this:

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM (or FPKM):

1)Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.

2)Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)

3)Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

However it is advised to use TPM instead. I think it depends on what you are planning to do.

Nevertheless, it seems to me like your data is already somewhat normalized, since your raw gene counts are not integer numbers.

ADD COMMENT • link 6.6 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

Thanks for the reply. I have made an edit in the sample table (corrected a mistake). I have gone through the article, but it just mentions that the difference between rpkm and fpkm is that rpkm is for single-end data, while fpkm is for paired-end data.

Also, under what condition is TPM advisable?

I want to normalize the gene expression and classify the samples (1098 samples) into different molecular subtypes for breast cancer, and then study the expression of certain genes across the subtypes to check whether they have a conserved pattern of expression across subtypes.

ADD REPLY • link 6.6 years ago by vinayjrao ▴ 250

1

Entering edit mode

Ok, so for manually comparing expression of the same genes within the same organism, I would always use TPM. However, I never did this manually, but always with the DESeq2 package of R since it does the normalization for you and comes with various additonal features.

Therefore, I would definitively give DESeq2 a try!

You mentioned, that your data is not in the right format. I think you could change that by transforming your column "gene" to rownames in R like that:

rownames(dataframe) <- dataframe$gene
dataframe$gene <- NULL

Now you have a legit count matrix wich you can use to create a DESeq dataset with the DESeq2 function DESeqDataSetFromMatrix. However you will need an additional dataframe first, which gives DESeq information about your samples, its called coldata in this case.

The very basic structur of a coldata object can be created somehow like this:

coldata <- data.frame(sample_names = c("sample1", "sample2", "sample3"), molecular_breastcancer_subtype = c("Subtype_A", "Subtype_B", "Subtype_A"))

rownames(coldata) <- coldata$sample_names
coldata$sample_names <- NULL

Be aware, that the order of your sampels in the coldata object matches the order of your samples in the countmatrix. Now you use the count matrix and the coldata dataframe to generate a DESeq2 object

dds <- DESeqDataSetFromMatrix(countData = dataframe,
                              colData = coldata,
                              design = ~ molecular_breastcancer_subtype)

You could also additonally consider multiple factors, than just "molecular_breastcancer_subtype" in you "design" parameter, if you entered the information in the coldata dataframe.

There are a few nice workflows on the bioconductor page like this

ADD REPLY • link 6.6 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

Thank you very much. I'll give it a try, but it certainly helped me understand better.

Edit: I considered your suggestion of taking TPM into account, and found that indeed it would be a better method to proceed with my analysis. Thank you.

ADD REPLY • link 6.6 years ago by vinayjrao ▴ 250