Hey everyone, I want to make a local database that I can parse with python that converts rsIDs into chromosomal locations. So I don't need the whole dbSNP information, but only the location.
My first attempt was to download this file (https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz) and filter it for the columns ID
, #CHROM
, POS
, REF
and ALT
(Using gzcat 00-All.vcf.gz | grep -v "##" | awk -v FS='\t' -v OFS='\t' '{print $3,$1,$2,$4,$5}' | gzip > 00-All_relevant.vcf.gz
. I wanted to load it using pandas
then do a little wrangling and then safe this as a pickle file to load it, when I need it.
However, this seems to be quite a memory intense task and I haven't even been able to do it due to memory issues. I worry that it might be a bottleneck to my tool. All I need is to get the genomic location of a variant after entering it. Can any of you think of a more efficient way without needing to make an only query?