I have a raw RNA expression data frame with genes as rows (HUGO gene names) and samples as columns (homo sapiens research). I want to add another column that contains the length of each gene, and that is in order conduct TPM normalization (gene length is needed in the formula).
I'm familiar with one way to get the genes lengths which is by the goseq library:
length <- goseq::getlength(gene_names, 'hg19', 'geneSymbol')
Unfortunately this package does not support the latest hg38. Thus, many of the genes are not supported and have no lengths. I don't want to lose that much of information, from 20000 genes I get only 15000 lengths.
After a quick search I found another way using Biomart and EDASeq::getGeneLengthAndGCContent
, however I dont understand how to use it and with what annotations.
I could really use some help with this function, or maybe some other way you guys might suggest.
Thanks!
Past thread that may be useful:
gene length for calculating TPM values
How should I use the
exonsBy
function? I don't have a txdb object.. I tried using it with my gene names vector but I don't think that's the way, I'm missing something here.