I am looking to improve the speed with which I find genomic sequence that flanks a given variant. Currently for a variant of interest, I use python to pull its chromosome and location, then query the UCSC genome browser for a given number of downstream and upstream bases like this:
check_output('wget -qO- http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=%s:%s,%s' % (chrom,low,high), stderr=STDOUT, shell=True)
This works just fine, especially when only working with a handful of variants. But when working with thousands of variants, this solution becomes quite slow because of the constant queries. So, is there a faster way of accomplishing the same task? I assume there is a more elegant way to use an offline copy of hg19 or something that would eliminate the need to constantly probe the genome browser?
This seems to be what others use, but it doesn't seem to be working for me. I do have
hg19.fa.fai
within the same directory ashg10.fa
, but runningsamtools faidx hg19.fa 1:6403804-6403874
just outputs>1:6403804-6403874
. Am I missing something?Check if you chromosomes' name start with 'chr' :
Beautiful,
samtools faidx hg19.fa chr1:6403804-6403874
works perfectly.