Question

How to link UCSC transcripts ids to protein ids using bioconductor ?

1

Entering edit mode

11.2 years ago

Aurelie MLB ▴ 360

Hello,

I am new to this area so apologies if the answer is obvious.

I am currenlty using Bioconductor packages to access the UCSC genome and get the transcripts for my genes of interest. But I also would like to link those transcripts to a protein if they are actually translated. I could not find an easy way using Bioconductor. I could get the CDS and translate them I presume but I would like to find more than this and access their Ensembl Ids for instance.

Would someone know how to do this please?

Many thanks

R genome gene sequence • 8.8k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 11.2 years ago by Aurelie MLB ▴ 360

0

Entering edit mode

Hi,

I have a related question. I noticed that through biomaRt I can only access homo sapiens ensembl dataset "Homo sapiens genes (GRCh38.p2)". I also want to translate Ensembl transcript IDs into RefSeq IDs, but my Ensembl Transcript IDs are from GRCh37/hg19 built. Do you have any advice on a way to get these IDs retrieved through biomaRt like in the example above? Maybe advice on a better way to do it?

I'd appreciate your advice lots!

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.3 years ago by ola.o4 • 0

Ram · Answer 1 · 2014-06-04

Your question doesn't really provide enough information, but maybe you're interested in the knownGenes track in a model organism, and there is already a Bioconductor package

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

# From here you can discover available 'keytypes' and 'columns'
keytypes(TxDb.Hsapiens.UCSC.hg19.knownGene)
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)

# Extract all the transcript ids
txid = keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "TXID")

# and get their corresponding Entrez gene ids
df = select(TxDb.Hsapiens.UCSC.hg19.knownGene, txid, "GENEID", "TXID")

leading to

head(df)
  GENEID  TXID
1      1 70455
2      1 70456
3     10 31944
4    100 72132
5   1000 65378
6   1000 65379

If you wanted more information about the genes, you might use library(org.Hs.eg.db) and then

head(select(org.Hs.eg.db, df$GENEID, c("SYMBOL", "GENENAME")))
  ENTREZID SYMBOL                                              GENENAME
1        1   A1BG                                alpha-1-B glycoprotein
2        1   A1BG                                alpha-1-B glycoprotein
3       10   NAT2 N-acetyltransferase 2 (arylamine N-acetyltransferase)
4      100    ADA                                   adenosine deaminase
5     1000   CDH2             cadherin 2, type 1, N-cadherin (neuronal)
6     1000   CDH2             cadherin 2, type 1, N-cadherin (neuronal)

Also, biomart is accessible through library(biomaRt). The package has a good vignette, available from the package landing page. See the introduction to Biocondcutor annotation work flows for some additional information. If you're more specific about what your needs are, then other approaches might be possible.

For more general annotations, the biomaRt package is very handy. The idea is discover the 'mart', 'dataset', 'filters' and 'attributes' available, via listMarts() etc., and then to compose a query

library(biomaRt)
## listMarts(), listDatasets("ensembl"), etc
mart <- useMart("ensembl", "hsapiens_gene_ensembl")
filters <- "ensembl_transcript_id"      # info I'll provide, see listFilters(mart)
attr <-                                 # info I want, ?listAttributes
    c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id") 
values = c("ENST00000275493", "ENST00000344576") # info I have

and then the query

getBM(attr, filt, values, mart)
  ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1 ENSG00000146648       ENST00000344576    ENSP00000345973
2 ENSG00000146648       ENST00000275493    ENSP00000275493

An alternative to the final line, consistent with the use of select in other annotation resources, is

select(mart, values, attr, filters)
  ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1 ENSG00000146648       ENST00000344576    ENSP00000345973
2 ENSG00000146648       ENST00000275493    ENSP00000275493

In truth I 'discovered' the relevant marts, data sets, etc., partly in R and partly by navigating the ensembl mart. Don't forget to check out the biomaRt vignette.

Ram · Answer 2 · 2014-06-04

0

Entering edit mode

11.2 years ago

Kizuna ▴ 880

if you do not have many transcript IDs, you can use biomart (ensembl) : http://www.ensembl.org/biomart/martview/b3b87cd3b220cf9d6d08d7de1a51fadd

you can also find easy tutorials for this tool :)

hope it helps

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 11.2 years ago by Kizuna ▴ 880

0

Entering edit mode

Hi Kizuna, Thanks a lot! The thing is I do have quite a few so I would like an automation of this.This is why I was interested by Bioconductor.

ADD REPLY • link 11.2 years ago by Aurelie MLB ▴ 360