Entering edit mode
13 months ago
Chris
▴
340
Hi Biostars,
I have a count matrix with mouse gene name and need to get RPKM. I know it is not a good metric but biologists used to it.
gtf <- readGFF("/reference_genome/mm39.ncbiRefSeq.gtf")
gtf_exon <- gtf[gtf$type == "exon", ]
width <- gtf_exon$end - gtf_exon$start + 1
gene_length <- aggregate(width, list(gtf_exon$gene_name), FUN = sum)
row.names(gene_length) <- gene_length$gene_name # may work
colnames(gene_length) <- c("gene_name", "gene_length")
gene_length <- gene_length %>% dplyr::select('gene_length')
gene_length <- gene_length[match(rownames(counts_mouse), rownames(gene_length)),]
y <- DGEList(counts=counts_matrix, genes=data.frame(Length=gene_length))
y <- calcNormFactors(y)
RPKM <- rpkm(y)
I looked for the gtf file to get the gene length but all the gtf files I found is not in gene name format. Would you please have a suggestion? Thank you so much! https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/genes/
Update: so many genes like this 1700012P22Rik at the beginning of the matrix make me think it is not gene symbol format.