I am using tBLASTn for ~370 MB size genome as database and around 3000 proteins sequences as query. To produced results BLAST takes a more time (10-11 hours). Please let me know, ant option to accelerate blast.
Thanks,
I am using tBLASTn for ~370 MB size genome as database and around 3000 proteins sequences as query. To produced results BLAST takes a more time (10-11 hours). Please let me know, ant option to accelerate blast.
Thanks,
"Does not work" how? You should explain better so we don't need to guess what is wrong / missing.
I will guess you got 100% identity, but not 100% coverage, and I will suggest you add "-qcov_hsp_perc".
How many threads are you using? -num_threads
? Maybe you can increase this so it can compute more at the same time
Another option is if you have multiple servers/computer/nodes is splitting up your protein fasta into multiple files and either running a multi threaded tblastn against your 370mb genome db locally on each computer/server
or if you have access to a cluster system send both the database and the split fastas to different nodes and start tblastn on each of them simultaneously.
If you need codes to do the node part just ask I can produce a script that does that :D
You are using an i7 so you should have 8 threads available to you. You could set -num_threads to 6 and that should perform the alignments in parallel instead of one after the other. This should speed up the whole process 6 fold.
For your question regarding removing duplicate query sequences. You can either create a Blast database from you query, blasting you query against it's own database using global alignment (local won't work for this). Then taking the matches that have different headers but still have a 100% Identity match and removing one of the two from you dataset.
You can also write a script that reads every sequence, stores it into a list, then for each following sequence check if it is already in the list, if not append it and finally print out the headers and sequences that are in the list.
For your identity match (i understand that you want only the 100% identity match hits between your genome and query) you can either throw a tabular blast format in excell, and just set set a filter for the column that holds the identity match.
You can also write a small script that reads each line of the blast output and only store lines with a 100% score in the column for identity match in a new file.
If you know how to work with R i also have a script that does this if you want. I can publish this somewhere, it takes four user inputs, the first being the file location, the second is the minimal identity match you want, the third the max number of mismatches and the fourth the max number of gaps. It will output a file called output that only has the blast hits that pass your arguments.
Grtz!
Thanks for reply. As you suggest about two script, 1. Script that reads every sequence, stores it into a list, then for each following sequence check if it is already in the list. 2. Do you have script script that reads each line of the blast output and only store lines with a 100% score. Do you have like scripts in perl or php. if you have please publish at our post. I am little new in R.
Thanks
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You edited your question in a manner my answer to your previous question is now completely senseless. If you have a new question, open a new post, do not edit an old post.
3,000 proteins against 370 genomes is supposed to take a long time. What do you hope to achieve with the result?
Sorry, Its slip of typing, it is ~370 MB genome size against 3,000 protein sequence.
And by 3,000 proteins do you mean a single sequence of 3000 amino acids? Or thirty thousand perhaps? What kind of creature has the 370MB genome? That's quite small and maybe it has difficult regions.
A query file contains three thousand protein sequences and 370 Mb genome contains nucleotide sequence.