Question

Ensembl: Protein coding transcript ids

0

Entering edit mode

10.7 years ago

bsmith030465 ▴ 240

Hi,

I wanted to use only protein coding transcripts from ensembl (Biomart - bioconductor). Is there a way to differentiate these from the transcript ids? Some example data that I have:

gene_symbol      ensembl_id ensembl_transcript_id transcript_start transcript_end ensembl_exon_id exon_chrom_start exon_chrom_end strand chromosome_name
1      OR4F29 ENSG00000235249       ENST00000426406           367640         368634 ENSE00002316283           367640         368634      1               1
2      OR4F16 ENSG00000185097       ENST00000332831           621059         622053 ENSE00002324228           621059         622053     -1               1
3      SAMD11 ENSG00000187634       ENST00000420190           860260         874671 ENSE00001637883           860260         860328      1               1
4      SAMD11 ENSG00000187634       ENST00000420190           860260         874671 ENSE00001763717           861302         861393      1               1
5      SAMD11 ENSG00000187634       ENST00000420190           860260         874671 ENSE00002727207           865535         865716      1               1

=========

or, how can I change my query such that only protein coding transcripts are returned? My query is:

gb <- getBM(attributes=c("ensembl_transcript_id","transcript_start","transcript_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end","strand","chromosome_name"),
                filters = "ensembl_gene_id", values=ensembl_id, mart=ensembl)

thanks!

id biomart ensembl transcript protein coding • 6.4k views

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by bsmith030465 ▴ 240

Ram · Answer 1 · 2014-10-29

4

Entering edit mode

10.7 years ago

komal.rathi ★ 4.1k

Add the attribute transcript_biotype to your query, save to a data.frame and filter:

res = getBM(attributes=c("ensembl_transcript_id","transcript_start","transcript_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end","strand","chromosome_name","gene_biotype","transcript_biotype"), filters = "ensembl_gene_id", values=ensembl_id, mart=ensembl)

# only keep transcripts that are protein coding
res = res[which(res$transcript_biotype=="protein_coding"),]

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Thanks!

I probably haven't explained myself adequately or clearly! For a gene, which is protein coding, there may be several transcripts associated with it. Some of these may be protein coding ("protein coding"), and other transcripts (for the same gene) may be non-coding ("retained intron" or something else).

So, for each transcript how can I retrieve whether it is classified as 'protein coding' or something else?

And apologies for an inadequate problem formulation!

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by bsmith030465 ▴ 240

0

Entering edit mode

bsmith030465 I have edited my answer. Please check. Use transcript_biotype attribute and filter out any that are not protein_coding.

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by komal.rathi ★ 4.1k

Ram · Answer 2 · 2014-10-30

1

Entering edit mode

10.7 years ago

onuralp ▴ 190

No, you can not deduce this information from the transcript id alone.

You have two options:

If you use the Biomart web interface, you may choose "transcript biotype" from the attributes to report.
If you want to use biomaRt package, you modify your query as follows:

Your query:

gb <- getBM(attributes=c("ensembl_transcript_id","transcript_start","transcript_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end","strand","chromosome_name"), filters = "ensembl_gene_id", values=ensembl_id, mart=ensembl)

Modified query - notice the new attribute transcript_biotype:

gb <- getBM(attributes=c("ensembl_transcript_id","transcript_start","transcript_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end","strand","chromosome_name", **"transcript_biotype"**), filters = "ensembl_gene_id", values=ensembl_id, mart=ensembl)

What do the different biotypes in Ensembl mean?

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by onuralp ▴ 190

0

Entering edit mode

onuralp I think you did not read the other answers but I mentioned that in my answer already. Look above.

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Hey Komal! Sorry, I have not seen your answer. I use an RSS reader to sift through the questions. I upvoted your comment.

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by onuralp ▴ 190

0

Entering edit mode

No worries. I like to reinforce the 'no redundancy' policy whenever and wherever possible :)

ADD REPLY • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by komal.rathi ★ 4.1k

Ram · Answer 3 · 2014-10-29

0

Entering edit mode

10.7 years ago

Manvendra Singh ★ 2.2k

add Ensembl Protein ID also In your attributes,

All rows having ENSPXXXXXXXXXXX [0,1] would be Ensembl protein id of protein coding transcript

HTH

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 10.7 years ago by Manvendra Singh ★ 2.2k