I need all the exons coordinates for a specific gene transcript (let's say, "ENST00000269305"). I try to do it with 2 different methods:
using getBM from biomaRt
library("biomaRt") human = useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl",host="grch37.ensembl.org", path="/biomart/martservice",ensemblRedirect = FALSE) transcript_info = data.frame(unique(getBM(attributes = c("chromosome_name", "genomic_coding_start","genomic_coding_end"), filters="ensembl_transcript_id", values="ENST00000269305",mart = human))) transcript_info[order(transcript_info$genomic_coding_start),]
chromosome_name genomic_coding_start genomic_coding_end 11 17 7572927 7573008 10 17 7573927 7574033 7 17 7576853 7576926 6 17 7577019 7577155 5 17 7577499 7577608 4 17 7578177 7578289 3 17 7578371 7578554 2 17 7579312 7579590 1 17 7579700 7579721 9 17 7579839 7579912 8 17 NA NA
my function, that uses GTF file (downloaded from Ensembl, filtered to have only protein-coding genes)
library(rtracklayer) library(GenomicRanges) gtf = import.gff("~/Documents/data/grch37/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.ProteinCodingGenes_filtered.gtf") gtf_exons <- gtf[gtf$type == "exon"] transcript_info = gtf_exons[gtf_exons$transcript_id == "ENST00000269305"] transcript_info = data.frame(transcript_info[,c()]) transcript_info = transcript_info[,c(1:3)] colnames(transcript_info) = c("chromosome_name", "genomic_coding_start","genomic_coding_end") transcript_info[order(transcript_info$genomic_coding_start),]
chromosome_name genomic_coding_start genomic_coding_end 11 17 7571720 7573008 10 17 7573927 7574033 9 17 7576853 7576926 8 17 7577019 7577155 7 17 7577499 7577608 6 17 7578177 7578289 5 17 7578371 7578554 4 17 7579312 7579590 3 17 7579700 7579721 2 17 7579839 7579940 1 17 7590695 7590856
Surprisingly, it gives me different results although some exons are shared. Manual checking with UCSC Genome Browser (hg19) supports the second script. What can be the reason that biomaRt fails to obtain some exons?
Thank you for the explanation! I believe, as I don't want to depend on external databases (I've been having a lot of trouble while looking for the coordinates of many genes via getBM due to ofter Ensembl mirrors shut-downs) I should use CDS + stop_codon mask from GTF file to get the same sets of coordinates as Ensembl provides.
You can also use the ensembldb package (https://bioconductor.org/packages/release/bioc/html/ensembldb.html) which will let you use a offline version of the Ensembl data (so no network issues) but in a database format you can construct all sorts of queries for.