Hello everyone
I am trying to write a code in R to calculate RPKM. (I am totally new in R, so having problems). For example i have the following txt file with length as last column,
RPKM= 10^9 * number of mapped reads in gene
----------------------------------------------------
total reads * gene length
R code:
count <- read.delim("sample_count.txt", row.names=1, stringsAsFactors=FALSE)
head(count)
for (i in 1:2) {
count[,i] <- (count[,i]*10^9) / sum(count[,i]) #(count[,i]*10^9) is 10^9 * number of mapped reads in gene
#sum(count[,i]) is total reads
}
I cant handle the gene length in last column of count table. Can anyone please help me with this code to get the correct answer?
Firstly, it's best to avoid for loops in R, particularly since you don't need them here. If the dataframe in your example were namedd:
row.names(d) <- d$Geneid
l <- d[,4]
d <- d[,-c(1,4)] #Remove the Geneid and length columns, they get in the way
cS <- colSums(d) #Total mapped reads per sample
rpkm <- (10^6)*t(t(d/l)/cS)
I'm assuming the lengths are already in kb, so there's a 10^6 multiplier instead of 10^9.
Edit: The matrix transposition in the last step are due to cS needing to be applied to each column, rather than sequentially across the column-order-stored matrix.
Thankyou so much for the reply... I made some changes in the code as it was giving some problem
row.names(d) <- d$Geneid
l <- d[,3] #accessing the length column
d <- d[,-c(3)] #Remove the Geneid and length columns, they get in the way
cS <- colSums(d) #Total mapped reads per sample
rpkm <- (10^9)*(d/l/cS)
this do give me a result .. but i think is not correct.. 1st colSums(d) donot shows any result and gives error Error: could not find function "colsums". 2nd Are you using thje rite formula to calculate RPKM? shouldnt that be rpkm <- (10^9d)(l*cS)
Also note RPKM should count the kilobases of exonic sequence, the length of the spliced transcript, NOT the distance genomically from the transcription start-site to the end. This is more tricky to calculate, but it looks like that data is given in your input table.
I guess you already set the rownames, so you can ignore that part. The error that you received was because you typed colsums rather than colSums, don't forget the capital S. Whether one uses 10^9 or 10^6 depends on whether the lengths in bp or kb. I noted in my original reply that I was assuming that your lengths were actually already in kb (otherwise, everything is really short), but if not then you should indeed use 10^9 instead. The difference of 1000 is just from the conversion of bp to kb.
Edit: BTW, you can't multiply l*cS, one has length 2 and the other 3, so R will have no clue how you want to handle it. Also, you want division anyway, since it's reads per kb per million mapped reads (aka, reads/length.in.kb/mapped.reads.in.millions).
Hi Devon,
I also trying to calculate RPKM from miRNA raw count. I tried your R code but i am getting error on this :
cS <- colSums(d) #Total mapped reads per sample
my input looks like ( I hope you saw below in this conversation, don't mind) with miRNA_length included in the last coloum. I am new to R. So, will help me.
Since you've mentioned Edit: Note that the R code above isn't exactly correct, since the division won't occur properly by column. can I just use the formula as rpkm <- (cs*10^9)/(d*l)?
I am impressed on your R code for RPKM calculation. Actually am working in 7 same tissue miRNA-seq data. I have raw counts for all miRNAsfrom 7 samples from featureCounts. Now my input files is looks like
Don't compute RPKM unless you have a compelling reason to do so. If you need to generate gene lengths then get the needed information from biomart or your GTF file.
Thankyou so much for the reply... I made some changes in the code as it was giving some problem
this do give me a result .. but i think is not correct.. 1st colSums(d) donot shows any result and gives error Error: could not find function "colsums". 2nd Are you using thje rite formula to calculate RPKM? shouldnt that be rpkm <- (10^9d)(l*cS)
Kindly do reply
Also note RPKM should count the kilobases of exonic sequence, the length of the spliced transcript, NOT the distance genomically from the transcription start-site to the end. This is more tricky to calculate, but it looks like that data is given in your input table.
I guess you already set the rownames, so you can ignore that part. The error that you received was because you typed
colsums
rather thancolSums
, don't forget the capital S. Whether one uses 10^9 or 10^6 depends on whether the lengths in bp or kb. I noted in my original reply that I was assuming that your lengths were actually already in kb (otherwise, everything is really short), but if not then you should indeed use 10^9 instead. The difference of 1000 is just from the conversion of bp to kb.Edit: BTW, you can't multiply
l*cS
, one has length 2 and the other 3, so R will have no clue how you want to handle it. Also, you want division anyway, since it'sreads per kb per million mapped reads
(aka,reads/length.in.kb/mapped.reads.in.millions
).Hi Devon, I also trying to calculate RPKM from miRNA raw count. I tried your R code but i am getting error on this : cS <- colSums(d) #Total mapped reads per sample
my input looks like ( I hope you saw below in this conversation, don't mind) with miRNA_length included in the last coloum. I am new to R. So, will help me.
You presumably have a column of gene names. That won't work. You can figure out the rest from there.
Dear Devon,
Since you've mentioned
Edit: Note that the R code above isn't exactly correct, since the division won't occur properly by column.
can I just use the formula asrpkm <- (cs*10^9)/(d*l)
?Thank you.
No, you really just need to transpose some of the matrices:
t(t(d/l)/cS)
. I'll update my answer.