Question

tBLASTn at standalone

0

Entering edit mode

9.5 years ago

Kumar ▴ 170

I am using tBLASTn for ~370 MB size genome as database and around 3000 proteins sequences as query. To produced results BLAST takes a more time (10-11 hours). Please let me know, ant option to accelerate blast.

Thanks,

blast genome • 3.6k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by Kumar ▴ 170

0

Entering edit mode

You edited your question in a manner my answer to your previous question is now completely senseless. If you have a new question, open a new post, do not edit an old post.

ADD REPLY • link 9.5 years ago by h.mon 35k

0

Entering edit mode

3,000 proteins against 370 genomes is supposed to take a long time. What do you hope to achieve with the result?

ADD REPLY • link 9.5 years ago by karl.stamm 4.1k

0

Entering edit mode

Sorry, Its slip of typing, it is ~370 MB genome size against 3,000 protein sequence.

ADD REPLY • link 9.5 years ago by Kumar ▴ 170

0

Entering edit mode

And by 3,000 proteins do you mean a single sequence of 3000 amino acids? Or thirty thousand perhaps? What kind of creature has the 370MB genome? That's quite small and maybe it has difficult regions.

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.5 years ago by karl.stamm 4.1k

0

Entering edit mode

A query file contains three thousand protein sequences and 370 Mb genome contains nucleotide sequence.

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.5 years ago by Kumar ▴ 170

score 0 · Answer 1 · 2015-06-07

0

Entering edit mode

9.5 years ago

h.mon 35k

"Does not work" how? You should explain better so we don't need to guess what is wrong / missing.

I will guess you got 100% identity, but not 100% coverage, and I will suggest you add "-qcov_hsp_perc".

ADD COMMENT • link 9.5 years ago by h.mon 35k

0

Entering edit mode

Thanks for reply. i used following command "tBLASTn -query <path> -db <path> -out <path> -outfmt 6 -evalue 1e-60 -qcov_hsp_perc 100" but out put results contains 70-100% identity. perhaps i need results should contains only 100% identity with 100% coverage.

ADD REPLY • link 9.5 years ago by Kumar ▴ 170

0

Entering edit mode

Perhaps you need both "-qcov_hsp_perc" and "-perc_identity"?

ADD REPLY • link 9.5 years ago by h.mon 35k

0

Entering edit mode

Ok. Do you have any idea to remove duplicate hits of a query sequence from result file of command line BLAST.

ADD REPLY • link 9.5 years ago by Kumar ▴ 170

Ram · Answer 2 · 2015-06-26

0

Entering edit mode

9.5 years ago

Lesley Sitter ▴ 610

How many threads are you using? -num_threads? Maybe you can increase this so it can compute more at the same time

Another option is if you have multiple servers/computer/nodes is splitting up your protein fasta into multiple files and either running a multi threaded tblastn against your 370mb genome db locally on each computer/server

or if you have access to a cluster system send both the database and the split fastas to different nodes and start tblastn on each of them simultaneously.

If you need codes to do the node part just ask I can produce a script that does that :D

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 9.5 years ago by Lesley Sitter ▴ 610

0

Entering edit mode

I have single computer with the configuration of 32 GB RAM, Intel (R) core i7.

thanks

ADD REPLY • link 9.5 years ago by Kumar ▴ 170

0

Entering edit mode

You are using an i7 so you should have 8 threads available to you. You could set -num_threads to 6 and that should perform the alignments in parallel instead of one after the other. This should speed up the whole process 6 fold.

For your question regarding removing duplicate query sequences. You can either create a Blast database from you query, blasting you query against it's own database using global alignment (local won't work for this). Then taking the matches that have different headers but still have a 100% Identity match and removing one of the two from you dataset.
You can also write a script that reads every sequence, stores it into a list, then for each following sequence check if it is already in the list, if not append it and finally print out the headers and sequences that are in the list.

For your identity match (i understand that you want only the 100% identity match hits between your genome and query) you can either throw a tabular blast format in excell, and just set set a filter for the column that holds the identity match.
You can also write a small script that reads each line of the blast output and only store lines with a 100% score in the column for identity match in a new file.
If you know how to work with R i also have a script that does this if you want. I can publish this somewhere, it takes four user inputs, the first being the file location, the second is the minimal identity match you want, the third the max number of mismatches and the fourth the max number of gaps. It will output a file called output that only has the blast hits that pass your arguments.

Grtz!

ADD REPLY • link 9.5 years ago by Lesley Sitter ▴ 610

0

Entering edit mode

Thanks for reply. As you suggest about two script, 1. Script that reads every sequence, stores it into a list, then for each following sequence check if it is already in the list. 2. Do you have script script that reads each line of the blast output and only store lines with a 100% score. Do you have like scripts in perl or php. if you have please publish at our post. I am little new in R.

Thanks

ADD REPLY • link updated 2.0 years ago by Ram 44k • written 9.5 years ago by Kumar ▴ 170