How to link UCSC transcripts ids to protein ids using bioconductor ?
2
1
Entering edit mode
10.5 years ago
Aurelie MLB ▴ 360

Hello,

I am new to this area so apologies if the answer is obvious.

I am currenlty using Bioconductor packages to access the UCSC genome and get the transcripts for my genes of interest. But I also would like to link those transcripts to a protein if they are actually translated. I could not find an easy way using Bioconductor. I could get the CDS and translate them I presume but I would like to find more than this and access their Ensembl Ids for instance.

Would someone know how to do this please?

Many thanks

R genome gene sequence • 8.0k views
ADD COMMENT
0
Entering edit mode

Hi,

I have a related question. I noticed that through biomaRt I can only access homo sapiens ensembl dataset "Homo sapiens genes (GRCh38.p2)". I also want to translate Ensembl transcript IDs into RefSeq IDs, but my Ensembl Transcript IDs are from GRCh37/hg19 built. Do you have any advice on a way to get these IDs retrieved through biomaRt like in the example above? Maybe advice on a better way to do it?

I'd appreciate your advice lots!

ADD REPLY
2
Entering edit mode
10.5 years ago
Martin Morgan ★ 1.6k

Your question doesn't really provide enough information, but maybe you're interested in the knownGenes track in a model organism, and there is already a Bioconductor package

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

# From here you can discover available 'keytypes' and 'columns'
keytypes(TxDb.Hsapiens.UCSC.hg19.knownGene)
columns(TxDb.Hsapiens.UCSC.hg19.knownGene)

# Extract all the transcript ids
txid = keys(TxDb.Hsapiens.UCSC.hg19.knownGene, "TXID")

# and get their corresponding Entrez gene ids
df = select(TxDb.Hsapiens.UCSC.hg19.knownGene, txid, "GENEID", "TXID")

leading to

head(df)
  GENEID  TXID
1      1 70455
2      1 70456
3     10 31944
4    100 72132
5   1000 65378
6   1000 65379

If you wanted more information about the genes, you might use library(org.Hs.eg.db) and then

head(select(org.Hs.eg.db, df$GENEID, c("SYMBOL", "GENENAME")))
  ENTREZID SYMBOL                                              GENENAME
1        1   A1BG                                alpha-1-B glycoprotein
2        1   A1BG                                alpha-1-B glycoprotein
3       10   NAT2 N-acetyltransferase 2 (arylamine N-acetyltransferase)
4      100    ADA                                   adenosine deaminase
5     1000   CDH2             cadherin 2, type 1, N-cadherin (neuronal)
6     1000   CDH2             cadherin 2, type 1, N-cadherin (neuronal)

Also, biomart is accessible through library(biomaRt). The package has a good vignette, available from the package landing page. See the introduction to Biocondcutor annotation work flows for some additional information. If you're more specific about what your needs are, then other approaches might be possible.

For more general annotations, the biomaRt package is very handy. The idea is discover the 'mart', 'dataset', 'filters' and 'attributes' available, via listMarts() etc., and then to compose a query

library(biomaRt)
## listMarts(), listDatasets("ensembl"), etc
mart <- useMart("ensembl", "hsapiens_gene_ensembl")
filters <- "ensembl_transcript_id"      # info I'll provide, see listFilters(mart)
attr <-                                 # info I want, ?listAttributes
    c("ensembl_gene_id", "ensembl_transcript_id", "ensembl_peptide_id") 
values = c("ENST00000275493", "ENST00000344576") # info I have

and then the query

getBM(attr, filt, values, mart)
  ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1 ENSG00000146648       ENST00000344576    ENSP00000345973
2 ENSG00000146648       ENST00000275493    ENSP00000275493

An alternative to the final line, consistent with the use of select in other annotation resources, is

select(mart, values, attr, filters)
  ensembl_gene_id ensembl_transcript_id ensembl_peptide_id
1 ENSG00000146648       ENST00000344576    ENSP00000345973
2 ENSG00000146648       ENST00000275493    ENSP00000275493

In truth I 'discovered' the relevant marts, data sets, etc., partly in R and partly by navigating the ensembl mart. Don't forget to check out the biomaRt vignette.

ADD COMMENT
0
Entering edit mode

Hi Martin, Thank you so much for your answer. I was trying to link the transcripts to a protein product. For instance, on the Ensembl interface for a given gene (e.g. EGFR), I noticed that you can see several transcripts (e.g.: ENST00000275493 or ENST00000344576 ) and for each transcripts a protein is associated (e.g.: ENSP00000275493 or ENSP00000345973). I was trying to get to this kind of information using bioconductor and starting from USCS transcript ids.

ADD REPLY
0
Entering edit mode

I've updated my answer with how one can use biomaRt (the Bioconductor package) to query biomart (the online resource).

ADD REPLY
0
Entering edit mode

Thanks a lot ! This is really helpful !

ADD REPLY
0
Entering edit mode
10.5 years ago
Kizuna ▴ 880

if you do not have many transcript IDs, you can use biomart (ensembl) : http://www.ensembl.org/biomart/martview/b3b87cd3b220cf9d6d08d7de1a51fadd

you can also find easy tutorials for this tool :)

hope it helps

ADD COMMENT
0
Entering edit mode

Hi Kizuna, Thanks a lot! The thing is I do have quite a few so I would like an automation of this.This is why I was interested by Bioconductor.

ADD REPLY

Login before adding your answer.

Traffic: 2471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6