Question

Generating a Python parsable local db from dbSNP

0

Entering edit mode

3 months ago

gernophil ▴ 120

Hey everyone, I want to make a local database that I can parse with python that converts rsIDs into chromosomal locations. So I don't need the whole dbSNP information, but only the location.

My first attempt was to download this file (https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz) and filter it for the columns ID, #CHROM, POS, REF and ALT (Using gzcat 00-All.vcf.gz | grep -v "##" | awk -v FS='\t' -v OFS='\t' '{print $3,$1,$2,$4,$5}' | gzip > 00-All_relevant.vcf.gz. I wanted to load it using pandas then do a little wrangling and then safe this as a pickle file to load it, when I need it.

However, this seems to be quite a memory intense task and I haven't even been able to do it due to memory issues. I worry that it might be a bottleneck to my tool. All I need is to get the genomic location of a variant after entering it. Can any of you think of a more efficient way without needing to make an only query?

Python • 310 views

ADD COMMENT • link updated 3 months ago by raphael.B ▴ 520 • written 3 months ago by gernophil ▴ 120

score 0 · Answer 1 · 2024-08-29

Pandas is not really suited for very large tables. I am not familiar with it but there seems to be a python API for dbSNP, which could be of interest for you. If you prefer to use the file you generated, a simple loop would be the easiest (and the quickest). Something like the following.

with open("id2pos.tsv") as file:
   for l in file.readlines():
       if l.strip().split("\t")==myid:
           print(l)
           break