Question

Mapping From Gene Names To Gene Lengths In R?

2

Entering edit mode

11.8 years ago

John St. John ★ 1.2k

I have a list of genes, and their ENTREZ and HUGO Ids, along with raw counts of RNA-seq reads mapping to those genes. I want to come up with a quick approximation of either RPKM or TPM.

To do that all I am missing is the length of each gene. Is there a quick way to get at this information in R? Maybe something like mapping ENTREZ to RefSeq ID, and then grabbing the length of whatever the first RefSeq transcript happens to be for each gene?

Thanks!

r gene length entrez • 11k views

ADD COMMENT • link updated 7.0 years ago by Biostar 20 • written 11.8 years ago by John St. John ★ 1.2k

score 9 · Answer 1 · 2013-02-01

Quick and dirty total exonic lengths, leveraging GenomeFeatures and the Hs org package:

  hg19GeneLengths <- function(symbols)
    {
        require(TxDb.Hsapiens.UCSC.hg19.knownGene) 
        require(org.Hs.eg.db)
        exons.db = exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by='gene')    
        egs    = unlist(  mget(symbols[ symbols %in% keys(org.Hs.egSYMBOL2EG) ],org.Hs.egSYMBOL2EG) )
        sapply(egs,function(eg)
        {
            exons = exons.db[[eg]]
            exons = reduce(exons)
            sum( width(exons) )
        })
    }
    hg19GeneLengths( c('STAT1','CXCL10','ACTB','PDCD1') )


 STAT1 CXCL10   ACTB  PDCD1 
  4943   1216   2253   2112

score 2 · Answer 2 · 2013-02-01

John,

You could consider rentrez, which is a pretty low-level R library for the entrez API (which I wrote!).

In this case you would want to find links to other databases from your gene IDs, and then get the summary file for the refseq record:

library(rentrez)

links <- entrez_link(db = "all", ids = 353174, dbfrom = "gene")
names(links)


##  [1] "gene_bioproject"            "gene_biosystems_all"       
##  [3] "gene_cdd"                   "gene_dbvar"                
##  [5] "gene_genome"                "gene_geoprofiles"          
##  [7] "gene_gtr"                   "gene_homologene"           
##  [9] "gene_nuccore"               "gene_nuccore_mgc"          
## [11] "gene_nuccore_pos"           "gene_nuccore_refseqrna"    
## [13] "gene_nucest"                "gene_nucest_clust"         
## [15] "gene_nucleotide"            "gene_nucleotide_clust"     
## [17] "gene_nucleotide_mgc"        "gene_nucleotide_mgc_url"   
## [19] "gene_nucleotide_pos"        "gene_omim"                 
## [21] "gene_pcassay"               "gene_pcassay_proteintarget"
## [23] "gene_pccompound"            "gene_pcsubstance"          
## [25] "gene_pmc"                   "gene_probe"                
## [27] "gene_protein"               "gene_protein_refseq"       
## [29] "gene_pubmed"                "gene_pubmed_citedinomim"   
## [31] "gene_pubmed_rif"            "gene_snp"                  
## [33] "gene_snp_genegenotype"      "gene_taxonomy"             
## [35] "gene_unigene"               "gene_unists"               
## [37] "file"

Once you have ID for the refseq record, you can fetch an (unparsed) XML file with entrez_summary:

e_sum <- entrez_summary(db = "nuccore", id = links$gene_nuccore_refseqrna)
print(e_sum)

## 
## http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_041029.dtd">
## <eSummaryResult>
##   <DocSum>
##     <Id>206725455</Id>
##     <Item Name="Caption" Type="String">NM_180990</Item>
##     <Item Name="Title" Type="String">Homo sapiens zinc activated ligand-gated ion channel (ZACN), mRNA</Item>
##     <Item Name="Extra" Type="String">gi|206725455|ref|NM_180990.3|[206725455]</Item>
##     <Item Name="Gi" Type="Integer">206725455</Item>
##     <Item Name="CreateDate" Type="String">2003/05/15</Item>
##     <Item Name="UpdateDate" Type="String">2012/11/11</Item>
##     <Item Name="Flags" Type="Integer">512</Item>
##     <Item Name="TaxId" Type="Integer">9606</Item>
##     <Item Name="Length" Type="Integer">1473</Item>
##     <Item Name="Status" Type="String">live</Item>
##     <Item Name="ReplacedBy" Type="String"/>
##     <Item Name="Comment" Type="String"></Item>
##   </DocSum>
## </eSummaryResult>
##

Presmusing this is the process you would go through, then wrapping it up in a function (and using Xpath to extract information from the XML) is straightforward:

get_entrez_length <- function(e_id) {
    links <- entrez_link(db = "nuccore", ids = e_id, dbfrom = "gene")
    summary_xml <- entrez_summary(db = "nuccore", id = links$gene_nuccore_refseqrna)
    len <- xpathSApply(summary_xml, "//Item[@Name='Length']", xmlValue)
    return(as.integer(len))
}

get_entrez_length(353174)

## [1] 1473

If you do use rentrez and have any questions/suggestions/bugs let me know - still developing it so happy to make it more useful.

score 0 · Answer 3 · 2013-02-01

Hey John :D

I don't have an exact solution for you, but maybe this information would help (wouldn't necessarily be quick though):

It looks like someone was coding up an R interface to eutils: http://code.google.com/p/reutils/source/browse/branches/lite/eutils.r?r=24 . I looks to me like esearch and esummary are functional enough to take queries into an XML tree. Might be more of a hassle than its worth, though.

score 0 · Answer 4 · 2013-02-01

0

Entering edit mode

11.8 years ago

Irsan ★ 7.8k

biomaRt?

ADD COMMENT • link 11.8 years ago by Irsan ★ 7.8k