mapping gene type or biotype to ENSEMBL ID
1
0
Entering edit mode
3.9 years ago
basuanubhav ▴ 140

Hi all,

I have a list of ENSEMBL ids of human lncRNA's for which I am trying to figure out the gene type (or biotype) eg. lincRNA, processed transcript, antisense, sense_overlapping, etc. Now, I have a GTF file (GENCODE v30 ) which contains the gene_id and gene_type argument in the 9th column. I could somehow try to use a code/script to map my IDs using this GTF file, but I was wondering whether there was an easier way to do it using online tools? I tried biomaRt but the current version of the ensemble release collapses the various types of lncRNAs to a single type ie. lncRNA. I really want to get the subtypes for each lncRNA.

P.S. The last GENCODE version with the lncRNA type 'split up' is the v30.

Thanks in advance :)

Annotation ensembl Org.Hs.eg.db biomaRt • 1.9k views
ADD COMMENT
1
Entering edit mode

Since you have not provided any example ID's I can't check but I suggest taking a look at RNACentral.

ADD REPLY
0
Entering edit mode

Thanks a lot!! Ill check it out :)

ADD REPLY
1
Entering edit mode
3.9 years ago

I'll use the latest gencode GTF for humans as an example.

curl ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.annotation.gtf.gz | \
gunzip > human_gencode_36.gtf

You can import the GTF file into R using rtracklayer::import and keep only the data from the GTF you want.

library("tidyverse")
library("rtracklayer")

gtf <- import("human_gencode_36.gtf") %>%
  as_tibble %>%
  distinct(gene_id, gene_name, gene_type)

> gtf
# A tibble: 60,660 x 3
   gene_id           gene_type                          gene_name  
   <chr>             <chr>                              <chr>      
 1 ENSG00000223972.5 transcribed_unprocessed_pseudogene DDX11L1    
 2 ENSG00000227232.5 unprocessed_pseudogene             WASH7P     
 3 ENSG00000278267.1 miRNA                              MIR6859-1  
 4 ENSG00000243485.5 lncRNA                             MIR1302-2HG
 5 ENSG00000284332.1 miRNA                              MIR1302-2  
 6 ENSG00000237613.2 lncRNA                             FAM138A    
 7 ENSG00000268020.3 unprocessed_pseudogene             OR4G4P     
 8 ENSG00000240361.2 transcribed_unprocessed_pseudogene OR4G11P    
 9 ENSG00000186092.6 protein_coding                     OR4F5      
10 ENSG00000238009.6 lncRNA                             AL627309.1 
# … with 60,650 more rows

Let's say that you have a vector of gene_ids that you wanted to get the information for.

genes <- sample(gtf$gene_id, 5)

> genes
[1] "ENSG00000287105.1"  "ENSG00000254060.1"  "ENSG00000271538.6" 
[4] "ENSG00000148399.13" "ENSG00000234648.1"

You can simply filter the imported data using this vector.

> filter(gtf, gene_id %in% genes)
# A tibble: 5 x 3
  gene_id            gene_type            gene_name 
  <chr>              <chr>                <chr>     
1 ENSG00000271538.6  lncRNA               LINC02427 
2 ENSG00000287105.1  lncRNA               AC090577.1
3 ENSG00000254060.1  lncRNA               AC022778.1
4 ENSG00000148399.13 protein_coding       DPH7      
5 ENSG00000234648.1  processed_pseudogene AL162151.2
ADD COMMENT
1
Entering edit mode

Ah, thanks a lot for the prompt and clear answer!! Actually, after posting the question I tried the rtracklayer::import function and saved the GTF as a dataframe where one column was gene type. After that (since I'm not very confident with dplyr), I just ran a loop over my ENSEMBLID's and used the 'match' function to get the corresponding gene type from the data frame. So, I think we approached it the same way, but I am sure the dataframe can be better manipulated using dplyr to get very fine-tuned information.

So, thanks again for the answer :)

ADD REPLY

Login before adding your answer.

Traffic: 2695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6