Question

How to Blast with multiple species - Phylogenetic Analysis

0

Entering edit mode

4.9 years ago

dimitrischat ▴ 210

Hi all.

Need some advice on how to proceed with phylogenetics analysis. I got a fasta file with amino acid sequences from human. (There are about 270 sequences - this number isnt important - just mentioning it)

>chr1:228672725-228672991
tcgatctcttgatcttgtgatccacctgcctcagcctcccaaagtgctgggattacaggcgtgtgccaccacatccag...
>chr1:941749-942423
TAAGATGGGATCCAGCAGTGCGAGACTGTGGCCCAGGTCAGATGGTGGCAGCTCGGCCTTCCTGGT...
..
.

I would like to blast these sequences against different species, then multi align them and have a maf ouput in order to feed it to phast (http://compgen.cshl.edu/phast/). My questions are: 1) Can i blast all my sequences to species of my desire, and how? 2) Should i blast one by one of my sequences (of my choosing) to species of my desire, and how? 3) I am trying to find conservation scores between species and phylogenetic models (phastcons and phylofit), is my approach correct?

Any help is much needed. Thanks a lot.

P.S. If i didnt explain something correctly, its because this is something very new to me. Thank you for your understanding

ChIP-Seq alignment sequence • 2.2k views

ADD COMMENT • link 4.9 years ago by dimitrischat ▴ 210

1

Entering edit mode

1) Can i blast all my sequences to species of my desire,

By using the limit taxID option with blast+

 -taxids <String>
   Restrict search of database to include only the specified taxonomy IDs
   (multiple IDs delimited by ',')

ADD REPLY • link 4.9 years ago by GenoMax 150k

0

Entering edit mode

Thanks. But which database should i use?

ADD REPLY • link 4.8 years ago by dimitrischat ▴ 210

1

Entering edit mode

If you use a large database like nt or nr then you can limit the searches to any number of taxonomy ID's using the method above.

ADD REPLY • link 4.8 years ago by GenoMax 150k

0

Entering edit mode

i used this command that i found from an older post of yours, because i have amino acid sequences and i want to multi align them later on so i can create a phylogeny tree:

perl update_blastdb.pl --decompress nt

Now i got files from "nt.00.tar.gz + nt.00.tar.gz.md5" to "nt.22.tar.gz + nt.22.tar.gz.md5". Now what is the next step? Extract them (guessing into 1 file because makeblastdb requires as input a fasta file?) and then try ./makeblastdb ?

And then try

./blastn -query my.fasta -db nt.fasta -outfmt (which one? i need to get the aligned sequences only) -taxids xxxxx,yyyyy,zzzzz, > output

Sorry but i am stuck, i cant use ncbi online because cpu limit is reached

ADD REPLY • link 4.8 years ago by dimitrischat ▴ 210

1

Entering edit mode

The .md5 contain md5hash values that are used for checking the integrity of the downloaded nt files. You can leave them as is. Uncompress all the other nt.tar.gzfiles by usingtar -zxvf nt.tar.gz` (will take some time). Keep all the files that result in one directory.

outfmt 6 is generally used if you want to parse the file using programmatic means. See the description of the format here. You will need to include staxids in your command if you want to distinguish where the hits are.

ADD REPLY • link 4.8 years ago by GenoMax 150k

0

Entering edit mode

thanks again.

i tried:

./blastn -query '/my.fasta' -db '/bin/blast_nt_db' -out my.out -outfmt "6 qseqid sseqid pident qseq sseq length mismatch evalue staxids 4577;5833;6239;7215;7745;7955;10090;11320"

but produced this error:

BLAST Database error: No alias or index file found for nucleotide database [/home/app/Desktop/apps/ncbi-blast-2.10.0+/bin/blast_nt_db] in search path [/home/app/Desktop/apps/ncbi-blast-2.10.0+/bin::]

ADD REPLY • link 4.8 years ago by dimitrischat ▴ 210

1

Entering edit mode

If you put all the nt files in /bin/blast_nt_db directory then you need to supply basename of the blast index to -db command as -db /bin/blast_nt_db/nt.

Multiple taxID should be separate by , not ;.

ADD REPLY • link 4.8 years ago by GenoMax 150k

0

Entering edit mode

Thanks, i am trying to run the command now. Although when typing ./blastn -help, the option for staxids :

staxids means unique Subject Taxonomy ID(s), separated by a **';'** (in numerical order)

ADD REPLY • link 4.8 years ago by dimitrischat ▴ 210

1

Entering edit mode

I thought you wanted to restrict your search to specific taxID. That needs to be done on the command line as noted here : C: How to Blast with multiple species - Phylogenetic Analysis

staxid option is for -outfmt 6 format to display the taxID in the result.

ADD REPLY • link 4.8 years ago by GenoMax 150k

0

Entering edit mode

Yes i do. But on this post C: How to Blast with multiple species - Phylogenetic Analysis , you mentioned staxids, not staxid. Staxids need ";" for separation. For staxid -help doesnt mention anything. So i should use staxid with "," ? Thanks again

ADD REPLY • link 4.8 years ago by dimitrischat ▴ 210

1

Entering edit mode

staxids option is for formatting blast results that are being written to the result file. Without that you would not know which taxID the result belongs to.

Your command should be something like:

./blastn -query '/my.fasta' -db '/bin/blast_nt_db/nt' -out my.out -taxids 4577,5833,6239,7215,7745,7955,10090,11320 -outfmt "6 qseqid sseqid pident qseq sseq length mismatch evalue staxids"

ADD REPLY • link 4.8 years ago by GenoMax 150k

0

Entering edit mode

I ran the command three times, like so:

./blastn -query '/my.fasta' -db '/blast_nt_db/nt' -out my.out -num_threads 8 -outfmt "6 qseqid sseqid pident qseq sseq length mismatch evalue staxids 4577,5833,6239,7215,7745,7955,10090,11320"

./blastn -query '/my.fasta' -db '/blast_nt_db/nt' -out my1.out -num_threads 8 -outfmt "6 qseqid sseqid pident qseq sseq length mismatch evalue staxids 4577;5833;6239;7215;7745;7955;10090;11320"

./blastn -query '/my.fasta' -db '/blast_nt_db/nt' -out /my2.out -num_threads 8 -outfmt "6 qseqid sseqid pident qseq sseq length mismatch evalue staxid 4577,5833,6239,7215,7745,7955,10090,11320"

All same same size (2.5gb) and i think they are identical.

Now, how should i proceed selecting for each of my sequences, the 1-2 top hits of each taxonomy? Or should i proceed some other way ?

ADD REPLY • link 4.8 years ago by dimitrischat ▴ 210

1

Entering edit mode

Did you not read my last comment? This is not the way to run that blast command. Scroll to the right in the command I have in last comment.

ADD REPLY • link 4.8 years ago by GenoMax 150k

0

Entering edit mode

my bad, didnt see. Just ran it, works! Thanks

Now, in order to do the multi alignment and also create a phylogeny tree, i should pick the 1-2 top hits of each taxonomy of each of my sequences?

Or some other way?

Thanks a lot!!!

ADD REPLY • link 4.8 years ago by dimitrischat ▴ 210