Question

Batch Fetching Fasta Sequences From Bed File

3

Entering edit mode

14.0 years ago

Carl ▴ 80

Hi,

I have a number of bed files, for which I would like to extract the corresponding fasta sequences. What is the best way to do this in batch (i.e. without using e.g. the UCSC table browser), since UCSC does not provide (to my knowledge) any web service.. any Bioperl solution ? Or some other web service ? If possible, I would like to avoid installing the genome locally (otherwise the bedtools would be a solution).

Thx!

fasta bed • 22k views

ADD COMMENT • link updated 7.9 years ago by Maximilian Haeussler ★ 1.7k • written 14.0 years ago by Carl ▴ 80

0

Entering edit mode

How many sequences are you talking? a dozen? thousands? millions? The solution may very well differ depending on your needs.

ADD REPLY • link 14.0 years ago by Chris Miller 22k

0

Entering edit mode

make sure there is a reason for avoiding to download the genome - I've never seen an analysis that did not need to be repeated many times over and that can quickly backfire and you'll end up with far slower solutions

ADD REPLY • link 14.0 years ago by Istvan Albert 102k

score 9 · Answer 1 · 2011-04-13

9

Entering edit mode

14.0 years ago

brentp 24k

You can do this with bedtools if you download the sequence data here's a shell script for the entire process if you're using hg19:

BED=your.bed
for chr in `seq 1 22` X Y
do
    wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz | gunzip -c >> hg19.fa
done

fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta

EDIT: argh. Somehow I missed that you explicitly said you didn't want to download the sequence data or bedtools would be a solution.

ADD COMMENT • link 14.0 years ago by brentp 24k

1

Entering edit mode

".../chr1.fa.gz ..." produced a fasta file with the chr1 sequence concatenated 24 times, I needed to put the variable name in place of the 1 ".../chr1$chr.fa.gz..." to get this to work:

BED=$1
for chr in `seq 1 22` X Y
do
    wget -O - -q http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr$chr.fa.gz | gunzip -c >> hg19.fa
done

fastaFromBed -fi hg19.fa -bed $BED -fo $BED.fasta

(NB - "BED=$1" - bedfile name is first argument passed to script)

ADD REPLY • link 8.4 years ago by rjactonspsfcf ▴ 180

score 4 · Answer 2 · 2011-04-13

4

Entering edit mode

14.0 years ago

lh3 33k

I have a solution for hg18/b36, but not for hg19/b37.

awk '{print $1":"($2+1)"-"$3}' in.bed | xargs samtools faidx ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/technical/reference/human_b36_male.fa.gz > out.fa

For razip compressed and faidx indexed fasta files, samtools is able to retrieve subsequences over network.

In addition, for a couple of regions, you can still use UCSC/Ensembl URL to retrieve sequences. For many regions, I believe Ensembl has APIs, though I have not used that.

ADD COMMENT • link 14.0 years ago by lh3 33k

0

Entering edit mode

so all that's required for hg19 is an razip'ed file on a public server, yes?

ADD REPLY • link 14.0 years ago by brentp 24k

0

Entering edit mode

And it is accompanied by a ".fai" file, the fasta index: two files on a public HTTP/FTP server.

ADD REPLY • link 14.0 years ago by lh3 33k

Ram · Answer 3 · 2011-04-13

4

Entering edit mode

14.0 years ago

Pierre Lindenbaum 165k

Yes, there is a web service (DAS) to fetch the genomic DNA sequence: e.g.:

http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:10001000,10003000

see this other post for some solutions related to your problem: Perl To Retrieve Sequences From Ucsc Genome Browser

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.0 years ago by Pierre Lindenbaum 165k

Ram · Answer 4 · 2017-05-09

Use a command like this:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit -bed=input.bed test.fa

or for a single region:

twoBitToFa http://hgdownload.cse.ucsc.edu/gbdb/hg19/hg19.2bit test.fa -seq=chr21 -start=1 -end=10000

Requires the UCSC tool twoBitToFa, available from http://hgdownload.cse.ucsc.edu/admin/exe/

If you're not on the hg19 genome, you have to index your .fa file first with faToTwoBit