Hi all.
Need some advice on how to proceed with phylogenetics analysis. I got a fasta file with amino acid sequences from human. (There are about 270 sequences - this number isnt important - just mentioning it)
>chr1:228672725-228672991
tcgatctcttgatcttgtgatccacctgcctcagcctcccaaagtgctgggattacaggcgtgtgccaccacatccag...
>chr1:941749-942423
TAAGATGGGATCCAGCAGTGCGAGACTGTGGCCCAGGTCAGATGGTGGCAGCTCGGCCTTCCTGGT...
..
.
I would like to blast these sequences against different species, then multi align them and have a maf ouput in order to feed it to phast (http://compgen.cshl.edu/phast/). My questions are: 1) Can i blast all my sequences to species of my desire, and how? 2) Should i blast one by one of my sequences (of my choosing) to species of my desire, and how? 3) I am trying to find conservation scores between species and phylogenetic models (phastcons and phylofit), is my approach correct?
Any help is much needed. Thanks a lot.
P.S. If i didnt explain something correctly, its because this is something very new to me. Thank you for your understanding
1) Can i blast all my sequences to species of my desire,
By using the
limit taxID
option with blast+Thanks. But which database should i use?
If you use a large database like
nt
ornr
then you can limit the searches to any number of taxonomy ID's using the method above.i used this command that i found from an older post of yours, because i have amino acid sequences and i want to multi align them later on so i can create a phylogeny tree:
Now i got files from "nt.00.tar.gz + nt.00.tar.gz.md5" to "nt.22.tar.gz + nt.22.tar.gz.md5". Now what is the next step? Extract them (guessing into 1 file because makeblastdb requires as input a fasta file?) and then try ./makeblastdb ?
And then try
Sorry but i am stuck, i cant use ncbi online because cpu limit is reached
The
.md5
contain md5hash values that are used for checking the integrity of the downloadednt
files. You can leave them as is. Uncompress all the other nt.tar.gzfiles by using
tar -zxvf nt.tar.gz` (will take some time). Keep all the files that result in one directory.outfmt 6
is generally used if you want to parse the file using programmatic means. See the description of the format here. You will need to includestaxids
in your command if you want to distinguish where the hits are.thanks again.
i tried:
but produced this error:
If you put all the
nt
files in/bin/blast_nt_db
directory then you need to supply basename of the blast index to-db
command as-db /bin/blast_nt_db/nt
.Multiple taxID should be separate by
,
not;
.Thanks, i am trying to run the command now. Although when typing ./blastn -help, the option for staxids :
I thought you wanted to restrict your search to specific
taxID
. That needs to be done on the command line as noted here : C: How to Blast with multiple species - Phylogenetic Analysisstaxid
option is for-outfmt 6
format to display the taxID in the result.Yes i do. But on this post C: How to Blast with multiple species - Phylogenetic Analysis , you mentioned staxids, not staxid. Staxids need ";" for separation. For staxid -help doesnt mention anything. So i should use staxid with "," ? Thanks again
staxids
option is for formatting blast results that are being written to the result file. Without that you would not know which taxID the result belongs to.Your command should be something like:
I ran the command three times, like so:
All same same size (2.5gb) and i think they are identical.
Now, how should i proceed selecting for each of my sequences, the 1-2 top hits of each taxonomy? Or should i proceed some other way ?
Did you not read my last comment? This is not the way to run that blast command. Scroll to the right in the command I have in last comment.
my bad, didnt see. Just ran it, works! Thanks
Now, in order to do the multi alignment and also create a phylogeny tree, i should pick the 1-2 top hits of each taxonomy of each of my sequences?
Or some other way?
Thanks a lot!!!