How to get the gene length from a list of Ensembl IDs using biomaRt for instance (or any R-based method without having to download a separate annotation file first)?
Since gene_length
attribute does not exist in biomaRt, is there a better alternative than using start_position
and end_position
attributes, then substracting the 2 values like as follows:
library(biomaRt)
ensembl_list <- c("ENSG00000000003","ENSG00000000419","ENSG00000000457","ENSG00000000460")
human <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
start_pos = getLDS(attributes = "ensembl_gene_id", filters = "ensembl_gene_idl", values = ensembl_list , mart = human, attributesL = "start_position", martL = human, uniqueRows=T)
end_pos = getLDS(attributes = "ensembl_gene_id", filters = "ensembl_gene_idl", values = ensembl_list , mart = human, attributesL = "end_position", martL = human, uniqueRows=T)
gene_L <- merge(start_pos, end_pos, by.x="Gene.stable.ID", by.y="Gene.stable.ID")
gene_L$Length <- gene_L$Gene.end..bp. - gene_L$Gene.start..bp.
end_position - start_position
is potentially wrong due to splicing. Introns should probably not get counted, although you did not explain your application.You are right. I have a numeric gene expression matrix (in CPM) that I want to convert into FPKM. That's why I was looking for a way to get gene length. So do you think taking introns into account would matter here?
Absolutely.
The EDASeq
getGeneLengthAndGCContent
indeed takes exons (see line 109 of the code here). Because the CPM matrix was generated with HTSeq-count, I think I should use EDASeq and skip the intron, no? Just to fit with the same method.My understanding of gene is represented by this pic: https://upload.wikimedia.org/wikipedia/commons/5/54/Gene_structure_eukaryote_2_annotated.svg. esp DNA part. I guess OP requirement is total length of exons (all possible exons). This is different from gene length (IMO). Gene length from NCBI/Ensembl atleast cover all known transcripts (for gene). Gene length calculated by EDAseq doesn't make sense to me esp calling it gene length. So please take whatever is suitable for analysis. Code is provided for either case.
EDASeq is an overkill for this task, (if someone is interested only in gene lengths), here are a few alternatives.
https://bioinformatics.stackexchange.com/questions/4942/finding-gene-length-using-ensembl-id
Especially, the answer regarding the GenomicFeatures library.