I've been using this to obtain gene name and ensembl id
def get_gene_name(row, CHR, pos):
genes_list=data.gene_names_at_locus(contig=row[CHR], position=row[pos])
return list(filter(None, genes_list))
def get_ensembl_id(chromosome, position):
ensembl_url = f"https://rest.ensembl.org/overlap/region/human/{chromosome}:{position}-{position}?feature=gene"
headers = {"Content-Type": "application/json"}
response = requests.get(ensembl_url, headers=headers)
if response.status_code != 200:
raise Exception(f"Request failed: {response.status_code}")
data = response.json()
genes = [entry['gene_id'] for entry in data if entry['feature_type'] == 'gene']
return genes
However, I have 900k SNPs! so this isn't feasible. Is there a faster way to obtain this info? So far I've been doing a for loop for all 900k SNPs using:
df.apply(get_gene_name, args=('chr_name', 'chr_position'), axis=1)
It's taking way too long. Is there a faster way?
Where would I get the bed for the genes from? And how would that file look like? Could you provide a minimal example? Thanks!
How To Download The Entire Human Gene List With Their Chromosomal Location
chrom/start/end/name : https://genome.ucsc.edu/FAQ/FAQformat.html#format1
This filters the SNPs in GENE1. Output is:
but how do I obtain a file which contains all the SNPs along with the gene they're in. Thanks again for your help!
EDIT: Nevermind, removing -u did the job :)
hum...should be option '-loj'
Thanks! how should the files be sorted? numerically? alphabetically?
I'm sorting them this way, is this correct?
Depending on your variants but its possible to download dbSNP or gnomAD data and a lot of other data but your biggest challenge is not here.I no longer remember everything, but I think the files size order was about one or two hundrer Go, just for the SNPs annotations.
But warning ! It's correct way, if it's just to annotate a "clean" (or parsed) data. In addition, is possible to optimize your code with multitreating/multiprocessing library on python to have reasonable time to execution (it's also possible in C or others language ...).
Finaly if you want annotation taking into account the VCF file information, VEP is the most used bioinfo tools.
PS: During my MSc degree, I had developed a script in python using multiprocessing python library with a few algorithms part. I was able to annotate, extract, or modify SNPs fast on GWAS file in PED and MAP format on Macbook Air (below the minute). It was also a small cohort about 550 samples but about 900 000 SNP.
PS2: I know pandas has made progress in supporting the large file, but I don't think that a good future situation for your project in this way. I think that you will need manage your class object and memory usage, too.
For my script project describe in previous "PS", I modified each pointed address and assigned the new value to optimize the memory usage for example.
PS3: My opinion is based that you only have access to one computer, otherwise VEP tool should work.