Generating a Python parsable local db from dbSNP
1
0
Entering edit mode
3 months ago
gernophil ▴ 120

Hey everyone, I want to make a local database that I can parse with python that converts rsIDs into chromosomal locations. So I don't need the whole dbSNP information, but only the location.

My first attempt was to download this file (https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz) and filter it for the columns ID, #CHROM, POS, REF and ALT (Using gzcat 00-All.vcf.gz | grep -v "##" | awk -v FS='\t' -v OFS='\t' '{print $3,$1,$2,$4,$5}' | gzip > 00-All_relevant.vcf.gz. I wanted to load it using pandas then do a little wrangling and then safe this as a pickle file to load it, when I need it.

However, this seems to be quite a memory intense task and I haven't even been able to do it due to memory issues. I worry that it might be a bottleneck to my tool. All I need is to get the genomic location of a variant after entering it. Can any of you think of a more efficient way without needing to make an only query?

Python • 310 views
ADD COMMENT
0
Entering edit mode
3 months ago
raphael.B ▴ 520

Pandas is not really suited for very large tables. I am not familiar with it but there seems to be a python API for dbSNP, which could be of interest for you. If you prefer to use the file you generated, a simple loop would be the easiest (and the quickest). Something like the following.

with open("id2pos.tsv") as file:
   for l in file.readlines():
       if l.strip().split("\t")==myid:
           print(l)
           break
ADD COMMENT

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6