I have found a major bug in tabix that appears both in its command line and perl API contexts.
Tabix seems to retrieve more than what is queried for when dealing with single positions in a vcf file (for example dbSNP). The problems seems to occur if the neighboring positions are too close to the target position. Tabix cannot retrieve a single position if you supply a start position that is equal to the end position, so one must define a region small enough to include a only single position. The way I had been doing this is to supply a start position that is 0.5 less than my target position an an end point that is 0.5 more. This works most of the time, except for cases where a positions has neighbors within a few units of it.
Below are a few examples of what I am seeing. The grep command shows the target postion plus 6 lines of context in dbSNP.
The obvious solution would be to discard the results that don't match the target position. However, it does create some doubt about the reliability of tabix.
Has anybody else seen this? Could anybody explain why this occurs?
This example shows 1:900526's neighbor (900525) getting picked up
*Target: Chr 1 Pos 900526
$ `grep -P -A 3 -B 3 '1\t900526' 00-All.vcf`
1 900519 rs140019196 A G
1 900522 rs79133198 T C
1 900525 rs201438052 C
*1 900526 rs146302352 G A
1 900528 rs150067459 C T
1 900529 rs148779938 G A
1 900544 rs200240051 G A
1 900526 rs146302352 G A
$ `tabix 00-All.vcf.gz 1:900525.5-900526.5`
1 900525 rs201438052 C T
1 900526 rs146302352 G A
Perl: `$tabix_dbSNP->query(1, 900525.5, 900526.5)`
1 900526 rs146302352 G A
--------------------------------------------------------------
This case shows 1686296's neighbor (16862293) getting picked up.
*Target: Chr 1 Pos 16862296
$ grep -P -A 3 -B 3 '1\t16862296' 00-All.vcf`
1 16862273 rs76517011 C A
1 16862282 rs11260784 G A
1 16862293 rs71259516 GACG GACC,GGCG
*1 16862296 rs1759227 G C
1 16862340 rs200564364 G A
1 16862432 rs188066837 A G
1 16862454 rs71248115 T G
$ tabix 00-All.vcf.gz 1:16862295.5-16862296.5
1 16862293 rs71259516 GACG GACC,GGCG
1 16862296 rs1759227 G C
Perl: $tabix_dbSNP->query(1, 16862295.5, 16862296.5)
1 16862293 rs71259516 GACG GACC,GGCG
1 16862296 rs1759227 G C
This case works fine--returns on the target position; the neighbors are more than a few units away from the target position.
*Target: Chr 1 Pos 16952481
$ `grep -P -A 3 -B 3 '1\t16952481' 00-All.vcf`
1 16952371 rs57507062 C T
1 16952377 rs9658928 C T
1 16952382 rs744025 C T
*1 16952481 rs915311 A G
1 16952510 rs1765551 G A
1 16952565 rs71259460 TC TCCTC,TTCTT
1 16952577 rs11799496 C A
$ `tabix 00-All.vcf.gz 1:16952480.5-16952481.5`
1 16952481 rs915311 A G
Perl: `$tabix_dbSNP->query(1, 16952480.5, 16952481.5)`
1 16952481 rs915311 A G
Please try not to use words like "major bug" in the title unless you are 100% sure. It is very irritating, especially when you use floats for genomic coordinates.
I agree, this is poor form. I've edited the title to change to "possible bug" so as not to alarm too many people (like me when I saw the post).
My apologies. I won't make that mistake again.