Question

Finding a variant's flanking sequence

0

Entering edit mode

7.0 years ago

L. A. Liggett ▴ 130

I am looking to improve the speed with which I find genomic sequence that flanks a given variant. Currently for a variant of interest, I use python to pull its chromosome and location, then query the UCSC genome browser for a given number of downstream and upstream bases like this:

check_output('wget -qO- http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=%s:%s,%s' % (chrom,low,high), stderr=STDOUT, shell=True)

This works just fine, especially when only working with a handful of variants. But when working with thousands of variants, this solution becomes quite slow because of the constant queries. So, is there a faster way of accomplishing the same task? I assume there is a more elegant way to use an offline copy of hg19 or something that would eliminate the need to constantly probe the genome browser?

sequence sequencing • 1.8k views

ADD COMMENT • link updated 7.0 years ago by Chris Miller 22k • written 7.0 years ago by L. A. Liggett ▴ 130

score 2 · Accepted Answer · 2017-11-27

2

Entering edit mode

7.0 years ago

Chris Miller 22k

The command you're looking for is samtools faidx

Example: samtools faidx hg19.fa 1:6403804-6403874

ADD COMMENT • link 7.0 years ago by Chris Miller 22k

0

Entering edit mode

This seems to be what others use, but it doesn't seem to be working for me. I do have hg19.fa.fai within the same directory as hg10.fa, but running samtools faidx hg19.fa 1:6403804-6403874 just outputs >1:6403804-6403874. Am I missing something?

ADD REPLY • link 7.0 years ago by L. A. Liggett ▴ 130

2

Entering edit mode

Check if you chromosomes' name start with 'chr' :

head hg19.fa

ADD REPLY • link 7.0 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Beautiful, samtools faidx hg19.fa chr1:6403804-6403874 works perfectly.

ADD REPLY • link 7.0 years ago by L. A. Liggett ▴ 130