Entering edit mode
19 months ago
Barista
▴
10
I have an excel file, which contains columns chrom, pos, id, ref and alt. I want to add a new column, which will have the name of the genes for the corresponding rows.
For that I am using getBM() function in biomaRt, but it takes too much time to finish. I realize that it may be slow, due to the fact that my dataset contains 500,000 rows, but now it has been over an hour and it still did not finish this function.
This is how I do it:
options(max.print=1000000)
library(readxl)
library(dplyr)
vcf_data <- read_excel("/Users/.../rows.xlsx", col_names = TRUE)
vcf_data <- dplyr::rename(vcf_data, chrom = chrom, pos = pos, id = id, ref = ref, alt = alt)
vcf_data <- dplyr::select(vcf_data, chrom, pos, id, ref, alt)
vcf_data <- vcf_data[!grepl("^ns", vcf_data$id), ]
library(biomaRt)
mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
gene_names <- getBM(attributes = c("hgnc_symbol"),
filters = c("chromosome_name", "start", "end"),
values = list(vcf_data$chrom, vcf_data$pos, vcf_data$pos),
mart = mart)
merged_data <- merge(vcf_data, gene_names,
by.x = c("chrom", "pos"),
by.y = c("Chromosome", "Start"))
write.xlsx(merged_data, "/Users/.../fileWithGeneNames.xlsx", row.names = FALSE)
Is there a better way to do it? This is my first time using biomaRt, so I might have done something wrong.
Thank you so much! I will try this out right now. In case of any further problems doing this using VEP, can I add a new comment here and count on your help? I would appreciate this a lot!! :)
https://www.ensembl.org/info/about/contact/index.html