Hi,
I wanted to use only protein coding transcripts from ensembl (Biomart - bioconductor). Is there a way to differentiate these from the transcript ids? Some example data that I have:
gene_symbol ensembl_id ensembl_transcript_id transcript_start transcript_end ensembl_exon_id exon_chrom_start exon_chrom_end strand chromosome_name
1 OR4F29 ENSG00000235249 ENST00000426406 367640 368634 ENSE00002316283 367640 368634 1 1
2 OR4F16 ENSG00000185097 ENST00000332831 621059 622053 ENSE00002324228 621059 622053 -1 1
3 SAMD11 ENSG00000187634 ENST00000420190 860260 874671 ENSE00001637883 860260 860328 1 1
4 SAMD11 ENSG00000187634 ENST00000420190 860260 874671 ENSE00001763717 861302 861393 1 1
5 SAMD11 ENSG00000187634 ENST00000420190 860260 874671 ENSE00002727207 865535 865716 1 1
=========
or, how can I change my query such that only protein coding transcripts are returned? My query is:
gb <- getBM(attributes=c("ensembl_transcript_id","transcript_start","transcript_end","ensembl_exon_id","exon_chrom_start","exon_chrom_end","strand","chromosome_name"),
filters = "ensembl_gene_id", values=ensembl_id, mart=ensembl)
thanks!
Thanks!
I probably haven't explained myself adequately or clearly! For a gene, which is protein coding, there may be several transcripts associated with it. Some of these may be protein coding ("protein coding"), and other transcripts (for the same gene) may be non-coding ("retained intron" or something else).
So, for each transcript how can I retrieve whether it is classified as 'protein coding' or something else?
And apologies for an inadequate problem formulation!
bsmith030465 I have edited my answer. Please check. Use
transcript_biotype
attribute and filter out any that are notprotein_coding
.