Hi:
I am running some blastx. I am yet not 100% sure how to explain the results. For example: suppose, the query seq is 5000 nt long. It has 3 ORF (also suppose the AA lengths of the ORFs are 300, 400, and 500) in different frames. Now, how does blastx report the best hit from all these different ORF?
I am sort of intrigued because I am running blastx with these options (to get the best hit in output format 6):
-max_hsps 1 -max_target_seqs 1
So, what might be happening that I am loosing some data. Probably I am getting the best hit (based on bit-score) from only one ORF, and loosing the others.
Am I thinking it right (the result part)? Any idea how to break down the result for the individual ORF?
Thanks!!!
Thank you very much! Is there a way to pipe the sequences of the ORFs (of the query) that blastx uses to output? Also, do you know what is the minimum size of the ORF that blastx uses?
Thanks a ton!!
Yes there is actually.
If you are using Blast+ versions you can ask for tabular output eg.
-outfmt 6
, for which you can specify the fields to be outputted such asqseq
. Have a look at the blast helpblastn -help
for more details on this.Keep in mind that it will output the alignable part of the sequence which is not necessarily the same as the biological ORF .
In regard to your min length question : in theory the min length is the word-size , in practice it will be much longer because you will need much larger aligned parts to meet other filtering criteria (such as E-value) and as such it's not actually possible. to put a number on it.
In general I would (in most cases) not advise to use blast to look for ORFs in a sequence. It's good for giving you an indication but it's not capable of providing correct ORFs.
Thank you very much! I have used another program to get all the ORFs. Is have solved my concerns.
Now, I need something similar to qseq. Can I get the entire sequence of the subject? I want the fasta sequences of the top 5 hits for each of my queries. I do not see any options fro that on blast help page.
Thanks again.
No, that you can't get directly from the blast output file.
However, getting those is not much more difficult. if you can get the IDs from the top5 hits, you can use the blast utility
blastdbcmd
which will extract the full sequences fro those iDs from the blastDBYa!!!!! Thanks. Great that I can run command line to get all the sequences. I actually need to write a script with a loop to get sequences against my individual queries. Thanks!!
I have another question:
I want to run blastp against multiple db, and output 5 best hits (-max_target_seqs 5) from each of them (each db hits) . Suppose, I am running blastp against 5 db; I want all 25 hits into one output file.
I see that people use blastdb_aliastool. I just want your opinion. What is the best way?
Thanks a ton, like always!!
Yes, you can use blastdb_alias tool (or even simply specify all the DBs on the cmdline
-db db1 db2 db3
).However with the
-max_target_seqs 5
you can NOT specify that there will be 5 matches from each DB, you will only get the 5 matches over all DBs (the DBs are then considered as a single DB).If you want to have control per DB you will have to run the blast once for each DB and concatenate the results (tab blastoutput you can simply cat together)
Yes, it works! However, my problem has an extra step. I have multiple queries (1067 to be exact). I want to run them in a loop. Like this:
Seems to be not working ( I have pretty low level of bash scripting knowledge too!)
Not working as in you get nothing or are there errors?
To debug your code use an
echo
command. Look techo $i
andecho $j
to make sure those file names are coorectly expanded. Then add anecho
beforediamond
andcat
in code above (this will only cause the loop to print command lines) to see what command lines are being produced. This will allow you to find out what is wrong with your loop.Thanks. I think I am putting the two variables "i", and "j" in a wrong way. I have added echo in my commands:
No output files. And, got this in terminal.
*diamond blastp -p 30 -v --outfmt 6 --db echo db1.dmnd --query echo seq_aaab -e 5 -o seq_aaab.txt -k 5 --sensitive
diamond blastp -p 30 -v --outfmt 6 --db echo db2.dmnd --query echo seq_aaab -e 5 -o seq_aaab.txt -k 5 --sensitive
diamond blastp -p 30 -v --outfmt 6 --db echo db1.dmnd --query echo seq_aaac -e 5 -o seq_aaac.txt -k 5 --sensitive
diamond blastp -p 30 -v --outfmt 6 --db echo db2.dmnd --query echo seq_aaac -e 5 -o seq_aaac.txt -k 5 --sensitive
diamond blastp -p 30 -v --outfmt 6 --db echo db1.dmnd --query echo seq_aaad -e 5 -o seq_aaad.txt -k 5 --sensitive
diamond blastp -p 30 -v --outfmt 6 --db echo db2.dmnd --query echo seq_aaad -e 5 -o seq_aaad.txt -k 5 --sensitive
....
.......*
That is to be expected. What genomax suggested was to add the
echo
command for debugging your script, if you now want to really run it, you have to remove theecho
commands again. As you can see it now simply prints the commands it would have executed. Otherwise your script looks OK I think.I notice you now switched to diamond (in stead of 'normal' blast), any reason for that?
For the
-e
parameter: I think you need to add it as-e 1e-5
if you want to set an e-value cutoff of 1e-5 (now you have set it to 5 )Thanks. I got the echo part. I have been using diamond blastp (my db files are huge). However, I can now switch to regular blastp.
Thanks a ton for pointing out the e-value cut-off.
But, when I run the above loop, the output files are empty.
ok.
are all possible output files empty (so also the intermediate ones) or are there that are not empty?
a first problem I see with the loop is that you output all blast results from the different DBs to the same file, so in the best case you will only get the result of the last blast in the loop. I suggest to output to a file with $i and $j combination (eg.
${i}_${j}.txt
) , I would also omit the use of the " (should work without as well).your final
cat
is also not correct. now you create one output file for each of the input seq combining all the DB results. if that is what you want, then it's OK.If you want to combine all input files over all DBs you need to place the
cat
statement outside the loops. Moreover: that part will never work as you output the cat to a file called All.txt , theawk
part behind it will not do anything. If you want the awk statement to work you'll need to go for this:also note I added a $ to the file
I.Name.txt
Wow! Great. Thanks a ton. Almost there!!
I only know pieces about bash. So, confuse things.....
What I am using now is:
It's working the way I wanted except the cat part. I am probably putting it in a wrong place. What I want is: cat all the .txt output (from each of the db) for each of my query separately. So that I would get the top 5 hits in separate files for each of the query sequence from all the databases.
Can You help me with that?
Thanks!!
the problem there is that for each iteration in your loop the
*.txt
part will match all files txt files that are present in the current working folder. To avoid that I propose to do something like this:this will limit cat to only use files of each input seq and not all txt files. This will result in a file per input seq file with all the blast results against all DBs.
Thank you so much!!!!!!!!!! It's now working!!