Question

VCF tabix query by ID column

1

Entering edit mode

7.2 years ago

liangzebin5566 ▴ 30

I have a bgzip vcf file and build tabix index already, but I am wondering if tabix can query VCF file by ID, just like:

tabix my.vcf.gz id='rs527352437'

or is there another way to query dbSNP by ID quickly?

tabix vcf • 5.7k views

ADD COMMENT • link updated 7.2 years ago by Alex Reynolds 36k • written 7.2 years ago by liangzebin5566 ▴ 30

0

Entering edit mode

Hello, did you find good solution? I need to find position for some SNP in dbSNP.vcf by rsID . I am currently using awk command, but its really slow, because dbSNP database is huge.

ADD REPLY • link 5.7 years ago by MatthewP ★ 1.4k

1

Entering edit mode

you can use snpsift or bcftools: MatthewP

java -jar /opt/snpEff/SnpSift.jar filter "(ID =~ 'rs34875155')" clinvar.vcf.gz
bcftools view -i '%ID=="rs5771206"' clinvar.vcf,gz

bgzip dbsnp vcf and index bgzipped file.

ADD REPLY • link 5.7 years ago by cpad0112 21k

0

Entering edit mode

7.2 years ago

cpad0112 21k

$ zgrep -w rs34875155 clinvar.vcf.gz

No need to use tabix for system function, IMHO if you want only records. If you want headers and do it with a vcf filtering tool, then you can use snpsift. Example script to filter a vcf by a single rs ID is furnished below:(output is not pasted here as header lines in clinvar are huge in number):

$ java -jar /opt/snpEff/SnpSift.jar filter "(ID =~ 'rs34875155')" clinvarvcf.gz

if you have multiple rsids in a file, follow example provided by snpsift manual.

ADD COMMENT • link 7.2 years ago by cpad0112 21k

score 5 · Accepted Answer · 2017-09-29

5

Entering edit mode

7.2 years ago

Alex Reynolds 36k

Here's an approach that uses hg38 SNP data from UCSC, an rs* ID of interest, and pts-line-bisect to do a binary search.

Your VCF-formatted data will be in different columns, but the principle here is the same — you permute the ID into the first column of a text file. You sort that text file on the first column. You do a binary search on the sorted text file.

Binary searches on sorted data are fast — log n fast — resulting in searches that take fractions of a second where they would otherwise take minutes or longer.

1) Get SNPs and write them into a text file sorted by SNP ID. For hg38, for instance:

$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
   | gunzip -c \
   | awk -v OFS="\t" '{ print $5,$2,$3,($3+1) }' \
   | sort -k1,1 \
   > hg38.snp147.sortedByName.txt

2) Install pts-line-bisect:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

3) Run a query:

$ rs_of_interest=rs2814778
$ ./pts-line-bisect/pts_lbsearch -p hg38.snp147.sortedByName.txt ${rs_of_interest} \
   | head -1 \
   | cut -f2- \
   > result.bed

In your case, you could modify your VCF file to permute the columns, so that the ID is in the first column of all columns. You permute and then sort as in step 1. This is the file you query with pts-line-bisect. Whatever result you get — if any — you permute back into VCF field ordering, likely using awk or similar.

With some modifications, it should be possible to do a binary search on a sorted, compressed text file. An important consideration is that the sort order is stable, so that compressed blocks provide a consistent or "deterministic" direction to go in finding the right compressed block for the data being queried.

ADD COMMENT • link 7.2 years ago by Alex Reynolds 36k

0

Entering edit mode

It can work, but not well. The time cost of pts-line-bisect is O(s * log n), where s is average length of each line, and n is line number, according to pts_line_bisect' page. If there's some specified vcf index tool just like tabix and can query by ID column, with time cost O(1).

ADD REPLY • link 7.2 years ago by liangzebin5566 ▴ 30

0

Entering edit mode

As s is a constant, a binary search is O(log n). Tabix is not constant time. If you want O(1), you'll need to read the SNP database into a hash table, using the SNP ID as the key, and the rest of the SNP record as the value, in the hash table's key-value pairing.

ADD REPLY • link 7.2 years ago by Alex Reynolds 36k

0

Entering edit mode

I use this approach for a web frontend, which translates the user's SNP ID of interest into a genomic position. The web frontend redirects the user to a custom data browser centered at that SNP's position within milliseconds. The SNP file in question — the same dbSNP build 147 dataset in my answer — is a 5.5 GB text file, uncompressed. Searches against this file take a fraction of a second. For this application, I'm pretty happy with that result.

ADD REPLY • link 7.2 years ago by Alex Reynolds 36k

0

Entering edit mode

You're right, I make a misunderstood of O(log n), it is quite small. Thank you very much!

ADD REPLY • link 7.2 years ago by liangzebin5566 ▴ 30