I am confused by a subtle difference in the cruzdb coordinates and UCSC coordinates. For those unfamiliar cruzdb is a python package for accessing the UCSC databases. Suppose I am interested in exon 3/7 for the gene TAF3. This exon extends from base pair 7963920-7965742 on chromosome 10. You can be sure of this by zooming in on the two ends of the exon in the UCSC genome browser:
However; cruzdb appears to agree with the start of the exon but is off by one base at the stop:
import cruzdb
from cruzdb import Genome
g = cruzdb.Genome('hg38')
# case 1 only picks up the cds which appears to be the exon range according to cruzdb
chromosome = 'chr10'
start = 7963920
stop = 7965741
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features
# case 2 also pick up the intron on the left
chromosome = 'chr10'
start = 7963919
stop = 7965741
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features
# case 3 also picks up the intron on the right but this is the exon range according to UCSC
chromosome = 'chr10'
start = 7963920
stop = 7965742
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features
I post this at the risk of looking really dumb, but I cant risk annotating hundreds of regions and be off by one base. Thanks!
Update: for anyone that uses cruzdb you may want to look into this unsolved issue. I received this comment from the developer: