I have the list of genes having ensembl id's like "ENSG00000272379.1", I want to retrieve the corresponding chromosome number and start and end of the gene location on the chromosome. I have tried using the Biomart Ensembl (http://asia.ensembl.org/biomart/martview/e96b1b88c4e0cbaf1b9d7442ed9f9b68) but it does not process all the genes in the text file. For example, I have 6000 genes and it outputs the results of only 1200 genes. My input file looks like this
test <- read.table("MetaXcanOutput-BiomartInput.txt")
my_ids <- data.frame(ensembl_gene_id_version=c(test$v1))
where MetaXcanOutput-BiomartInput.txt is the file containing almost 6000 gene ids along with the version number. Moreover, the data frame object "my_ids" contains two identical column as follows.
I am pretty new to data science and R. But according to my understanding, in the last code my_ids.version <- merge(my_ids, t2g, by= 'ensembl_gene_id') it cannot find the proper ensembl_gene_id due to which my_ids.version is empty. Am I right? Can you further suggest?
ADD REPLY
• link
updated 18 months ago by
Ram
44k
•
written 5.8 years ago by
Star
▴
60
0
Entering edit mode
You need a column with your regular gene_id_versions called ensembl_gene_id_version. And Another column with the edited ensembl_gene_id without versions. created with gsub.
Its done!!! Thank you so much. But there is some problem. In the input file I have 6605 gene ids, however I get the results for the 6414 genes (I tried the same with the Biomart online tool as well). 191 genes are missing. Which means that those genes are not present in the database. What could be possible solution for it? How can I get the information about the remaining (all) genes?
Hi, Few are some of the genes missing from the data are listed below. However when I searched these genes in GRCh37 build, it mapped to the respective genes but no results were available for GRCh38.p12.
You can try to convert your retired IDs using the ID Conversion tool for GRCh37. Or maybe this Biostar link can help you to access the GRCh37 using biomaRt.
This Ensembl page contains information about converting between both assemblies.
All of the above suggestions are great - BioMart will work with versioned IDs but you need to select the correct format, and if they are not existing in the current database (regardless of the version) then no results will be found.
I took a look at a couple of your IDs, I can see that they are not in the dedicated GRCh37 site which is the database we continue to update with new data. However, I could find them in the archive site for GRCh37 from 2014, which is not updated so remains a snap shot of the data from 2014. This suggests to me that these genes are no longer in the current database, probably because the annotation has been reviewed and they were found to no longer be correct as new data (e.g. cDNA, protein, or EST) has become available.
You could pass your list of lost IDs through the archive's BioMart either on the website or through the R package - you can see how to do the latter here - you need to specify release 75: How To Use Archived Version Of Ensembl In Biomart. If you want to you can extract the coordinates and map them to GRCh38 using our Assembly converter.
Thankyou for the reply!!! I did it using R as well as Biomart interaface. I had a total of 6605 genes. However I get the required information for only 6414 genes. The data for 191 genes are missing (using GRCh38.p12). Is there any way to get the required information for all the genes?
Hi, Few are some of the genes missing from the data are listed below. However when I searched these genes in GRCh37 build, it mapped to the respective genes but no results were available for GRCh38.p12.
Thank you for the code Tiago211287. But at the end my file (my_ids.version) is returned empty. I had replaced the line
with
where
MetaXcanOutput-BiomartInput.txt
is the file containing almost 6000 gene ids along with the version number. Moreover, the data frame object "my_ids" contains two identical column as follows.I am pretty new to data science and R. But according to my understanding, in the last code
my_ids.version <- merge(my_ids, t2g, by= 'ensembl_gene_id')
it cannot find the properensembl_gene_id
due to whichmy_ids.version
is empty. Am I right? Can you further suggest?You need a column with your regular gene_id_versions called ensembl_gene_id_version. And Another column with the edited ensembl_gene_id without versions. created with gsub.
Only then you can merge, using:
Its done!!! Thank you so much. But there is some problem. In the input file I have 6605 gene ids, however I get the results for the 6414 genes (I tried the same with the Biomart online tool as well). 191 genes are missing. Which means that those genes are not present in the database. What could be possible solution for it? How can I get the information about the remaining (all) genes?
Can you please share a sample of the remaining id's?
Hi, Few are some of the genes missing from the data are listed below. However when I searched these genes in GRCh37 build, it mapped to the respective genes but no results were available for GRCh38.p12.
Is there a way to extract GRCh37 using biomart in R?
You can try to convert your retired IDs using the ID Conversion tool for GRCh37. Or maybe this Biostar link can help you to access the GRCh37 using biomaRt.
This Ensembl page contains information about converting between both assemblies.
Hello aammarah.632
All of the above suggestions are great - BioMart will work with versioned IDs but you need to select the correct format, and if they are not existing in the current database (regardless of the version) then no results will be found.
I took a look at a couple of your IDs, I can see that they are not in the dedicated GRCh37 site which is the database we continue to update with new data. However, I could find them in the archive site for GRCh37 from 2014, which is not updated so remains a snap shot of the data from 2014. This suggests to me that these genes are no longer in the current database, probably because the annotation has been reviewed and they were found to no longer be correct as new data (e.g. cDNA, protein, or EST) has become available.
You could pass your list of lost IDs through the archive's BioMart either on the website or through the R package - you can see how to do the latter here - you need to specify release 75: How To Use Archived Version Of Ensembl In Biomart. If you want to you can extract the coordinates and map them to GRCh38 using our Assembly converter.
hmm right. Thankyou. I have extracted the positions of all the genes from GRCh37. Thanks for the help tiago211287 and Erin_Ensembl.