Downloading Fasta Files
4
3
Entering edit mode
14.1 years ago
Mcdenzlix ▴ 50

I need to download about 40 complete genomes from ncbi and still filter out sequences between specified bps(like btn 1000bp to 3000bp) from the genomes separately. I need help on how to do that. I would also like to blast some sequences against each of the downloaded genomes to check for presence absence of the queries.

Please assist or give best guidelines

fasta blast genome sequence • 8.9k views
ADD COMMENT
5
Entering edit mode
14.1 years ago

not tested as you didn't post an example, use this only as a starting point:

URL=http://www.ncbi.org/pub/genomes
GENOMELIST=E_coli.fa.gz E_coli_strain2.fa.gz
INSEQFILE=myLocalFastaFileToBlast.fa
mkdir download
mkdir filtered
mkdir blast

for i in ${GENOMELIST}; do
  wget ${URL}/$i -O download/$i;
  gunzip download/$i;
  faFilter -minSize=1000 -maxSize=3000 download/$i filtered/$i;
  formatdb -i filtered/$i -p F;
  blastall -p blastn -i ${INSEQFILE} -o blast/$i.blast -e 0.000001;
done

faFilter is from the UCSC source code collection, see http://genome.ucsc.edu/admin/jk-install.html or also http://genomewiki.ucsc.edu/index.php/The_source_tree

ADD COMMENT
3
Entering edit mode
14.1 years ago
Lee Katz ★ 3.2k

Per usual, BioPerl has the answer.

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/GenBank.html

# you could make an array of IDs you need to fetch
use Bio::DB::GenBank;
$gb = Bio::DB::GenBank->new();
$seq = $gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID
@seqCoords=(
  [0, 100],
  [1000-1100]
);
$subseq=$seq->subseq($$seqCoords[0][0],$$seqCoords[0][1]);
# then, look at the blast modules and SearchIO to see how to start blasting and parsing
# http://www.bioperl.org/wiki/HOWTOs
ADD COMMENT
2
Entering edit mode
14.1 years ago

You can download your genomes, build a BLAST database with formatdb and then extract a second set of sequences using fastacmd:

ncbi/build/fastacmd has a option -L

  -L  Range of sequence to extract (Format: start,stop)
      0 in 'start' refers to the beginning of the sequence
      0 in 'stop' refers to the end of the sequence [String]  Optional
    default = 0,0

then run your blastall query with the second database.

ADD COMMENT
0
Entering edit mode
14.1 years ago
Casbon ★ 3.3k

Might help: http://www.dcode.org/sequences.php

ADD COMMENT

Login before adding your answer.

Traffic: 2445 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6