Hello everyone I am trying to write a code in R to calculate RPKM. (I am totally new in R, so having problems). For example i have the following txt file with length as last column,
Geneid Rep1 Rep2 gene Length
COMMD10 3 4 13
USP26 4 6 14
DDX17 6 7 20
and using the following formula to calculate RPKM
RPKM= 10^9 * number of mapped reads in gene
----------------------------------------------------
total reads * gene length
R code:
count <- read.delim("sample_count.txt", row.names=1, stringsAsFactors=FALSE)
head(count)
for (i in 1:2) {
count[,i] <- (count[,i]*10^9) / sum(count[,i]) #(count[,i]*10^9) is 10^9 * number of mapped reads in gene
#sum(count[,i]) is total reads
}
I cant handle the gene length in last column of count table. Can anyone please help me with this code to get the correct answer?
Thankyou so much for the reply... I made some changes in the code as it was giving some problem
this do give me a result .. but i think is not correct.. 1st colSums(d) donot shows any result and gives error Error: could not find function "colsums". 2nd Are you using thje rite formula to calculate RPKM? shouldnt that be rpkm <- (10^9d)(l*cS)
Kindly do reply
Also note RPKM should count the kilobases of exonic sequence, the length of the spliced transcript, NOT the distance genomically from the transcription start-site to the end. This is more tricky to calculate, but it looks like that data is given in your input table.
I guess you already set the rownames, so you can ignore that part. The error that you received was because you typed
colsums
rather thancolSums
, don't forget the capital S. Whether one uses 10^9 or 10^6 depends on whether the lengths in bp or kb. I noted in my original reply that I was assuming that your lengths were actually already in kb (otherwise, everything is really short), but if not then you should indeed use 10^9 instead. The difference of 1000 is just from the conversion of bp to kb.Edit: BTW, you can't multiply
l*cS
, one has length 2 and the other 3, so R will have no clue how you want to handle it. Also, you want division anyway, since it'sreads per kb per million mapped reads
(aka,reads/length.in.kb/mapped.reads.in.millions
).Hi Devon, I also trying to calculate RPKM from miRNA raw count. I tried your R code but i am getting error on this : cS <- colSums(d) #Total mapped reads per sample
my input looks like ( I hope you saw below in this conversation, don't mind) with miRNA_length included in the last coloum. I am new to R. So, will help me.
You presumably have a column of gene names. That won't work. You can figure out the rest from there.
Dear Devon,
Since you've mentioned
Edit: Note that the R code above isn't exactly correct, since the division won't occur properly by column.
can I just use the formula asrpkm <- (cs*10^9)/(d*l)
?Thank you.
No, you really just need to transpose some of the matrices:
t(t(d/l)/cS)
. I'll update my answer.