Question

Calculate RPKM

1

Entering edit mode

2.6 years ago

Chris ▴ 360

Hi bioinformaticians, would anyone calculate RPKM from the count matrix with edgeR or DESeq2?

I found some resources guide on this but there are one or two steps I don't know such as this:

y <- DGEList(counts=counts,genes=data.frame(Length=GeneLength)) 
y <- calcNormFactors(y) 
RPKM <- rpkm(y)

How to get GeneLength?

https://github.com/mgonzalezporta/TeachingMaterial/blob/master/doc/25.normalising.md

How to get annot object?

I appreciate your help!

RPKM • 2.8k views

ADD COMMENT • link 2.6 years ago by Chris ▴ 360

1

Entering edit mode

What organism would you like to get the gene length for? What is the source for the transcriptome you are using? Different organisms have different resources available for calculating gene length.

ADD REPLY • link 2.6 years ago by seidel 11k

0

Entering edit mode

Thank you for your reply. It is gene length for human. I am sorry for not being sure about the second question, in-house transcriptome.

ADD REPLY • link 2.6 years ago by Chris ▴ 360

1

Entering edit mode

Is the in-house transcriptome in GTF format? Do you know if you have multiple transcripts per gene? If your counts table reflects counts on genes, one way to get gene length, depending on how you've quantified reads across your transcriptome, would be to read the GTF file into R and create a TxDb object (using makeTxDbFromGFF), then you can extract out all exons by gene into a GRangesList, reduce them, and then sum their widths by gene. There are commands from the GRanges and GenomicFeatures libraries for each of these steps. This would give you a width representing the shadow of all exons for the gene on the genome (essentially all transcripts reduced). It's a bit of a fudge, as for a given locus with multiple transcripts it doesn't actually reflect any one transcript - it sort of represents a general bucket for the locus. In many cases we don't quantify individual isoforms of a locus, we quantify reads mapping to the locus overall. Just some caveats to keep in mind. Maybe others could suggest another approach.

ADD REPLY • link 2.6 years ago by seidel 11k

0

Entering edit mode

Thank you for your reply!

gene_length <- aggregate(width, list(gtf_exon$gene_name), FUN = sum)
y <- DGEList(counts=counts,genes=data.frame(Length=gene_length))

Error in DGEList(counts = counts, genes = data.frame(Length = gene_length)) : Counts and genes have different numbers of rows

Gene_length has 40192 observations with gene name meanwhile counts > 0 has 33863 observations with ensemple ID. Would you suggest a solution to this error?

ADD REPLY • link 2.6 years ago by Chris ▴ 360

1

Entering edit mode

If you're sure about your gene lengths, then simply make sure the counts matrix and gene_length matrix match each other. Assuming counts and gene_length have gene names as rownames:

# get the gene_lengths that match the counts matrix
gene_length <- gene_length[match(rownames(counts), rownames(gene_length)),]
# check the results
all(rownames(counts) == rownames(gene_length))

(modify the code as required, I hope the idea is clear)

ADD REPLY • link 2.6 years ago by seidel 11k

0

Entering edit mode

Thank you for your suggestion! Before using your code, I convert ensembl ID in counts matrix to gene name in gene length table:

raw_counts$symbol <- mapIds(org.Hs.eg.db, keys = rownames(raw_counts), keytype = "ENSEMBL", column = "SYMBOL")

'select()' returned 1:many mapping between keys and columns

Would you tell me how to fix the error above? I found this error here:

How to deal with `'select()' returned 1:many mapping between keys and columns` warning when retrieving Entrez IDs using Gene Symbols in R?

However this post is not clear to me to figure out how. I appreciate your help.

Update: I found a solution:

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("biomaRt")
library(biomaRt)
your_data_frame <- as.data.frame(counts) 
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl")) 
gene_IDs <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id","hgnc_symbol"), values = rownames(counts), mart= mart)

your_data_frame$GeneName <- gene_IDs[match(rownames(your_data_frame),gene_IDs[,1]),2]

The code is longer but it doesn't have error.

I try to replace the first column with Ensembl ID with the gene symbol. However, I can't get the index of the first column. The first indexing is the count of the first replicate. Would you suggest how to do so? Thank you so much!

Update: I figured it out.

ADD REPLY • link 2.6 years ago by Chris ▴ 360

0

Entering edit mode

I got gene_length and counts matrix with the same number of row with your suggestion.

y <-DGEList(counts=your_data_frame,genes=data.frame(Length=gene_length))

Error in DGEList(counts = your_data_frame, genes = data.frame(Length = gene_length)) :The count matrix is a data.frame instead of a matrix and the first column is of class characterinstead of being numeric. Was the first column intended to contain geneids?

I converted row names to column. How can I convert back to overcome the error back? Like this: https://www.geeksforgeeks.org/convert-values-in-column-into-row-names-of-dataframe-in-r/

I tried their solution but it didn't work.

ADD REPLY • link 2.6 years ago by Chris ▴ 360

1

Entering edit mode

the easiest way is to probably take the numeric columns of your data frame and create a matrix. For instance, if column 1 is gene names, and columns 2:10 are your count values, you could do:

foo <- as.matrix(your_data_frame[,2:10])
rownames(foo) <- your_data_frame[,1]

alternatively, if your_data_frame has gene ids as rownames:

foo <- as.matrix(your_data_frame[,2:10])
rownames(foo) <- rownames(your_data_frame)

Lastly, you can also omit columns with negative subscripts, so if column 1 is gene names:

foo <- as.matrix(your_data_frame[,-1])
rownames(foo) <- rownames(your_data_frame)

Then use foo in as your count matrix.

ADD REPLY • link 2.6 years ago by seidel 11k

0

Entering edit mode

Thank you so much! I figured it out a few hours ago. Could I ask you questions about ATAC-seq?

ADD REPLY • link 2.6 years ago by Chris ▴ 360

1

Entering edit mode

You're welcome to contact me, but you should just ask the forum overall with a new question. You'll likely get much better answers.