each protein with each protein
1
0
Entering edit mode
10.5 years ago
natasha.sernova ★ 4.0k

Dear colleagues,

I have a lot of proteins. What is the

simplest way to compare each one

of them with all the other proteins?

Probably I have to make a database

and use a program like BlastP.

But how to make sure that I've checked

all the protein pairs I have?

I feel that I have to use "for" in Linux,

but I've failed to finish the cycle expression.

Please. help me!

Many thanks!

N.

blast • 2.6k views
ADD COMMENT
0
Entering edit mode

You will need to learn how to compare one against one before attempting all vs all. Yes using "for" in linux will be important, but what do you hope to achieve? What kind of comparison? I feel this task is too large, all vs all is not going to be meaningful.

ADD REPLY
2
Entering edit mode
10.5 years ago
5heikki 11k

The simplest:

blastp -query all.fasta -subject all.fasta

Better:

makeblastdb -in all.fasta -dbtype prot -title all -out all -hash_index
blastp -query all.fasta -db all -out all-vs-all -outfmt 6

Better still:

makeblastdb -in all.fasta -dbtype prot -title all -out all -hash_index
blastp -query all.fasta -db all -out all-vs-all -outfmt 6 -use_sw_tback -seg yes -soft_masking true -max_target_seqs 20000 -num_threads "MAXonYourSystem"
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
for i in {4..6}
do
    awk '$11 <= 1e-"$i"' all-vs-all > all-vs-all.1e-"$i"
done

And all the other stuff like e.g. sorting for best hits, but really, your question was way too vague to say what should be done. E.g., if the seqs are from NCBI, you could include taxid_map in makeblastdb arguments and make use of that in blastp..

ADD COMMENT
0
Entering edit mode

Dear 5heikki,

That's really great, thank you very much! My goal is to find a correct way to search for the best hit.

It turned out to be rather complicated. "Sorting for best hit" - could you, please, explain to me, how to do it correctly!

My seqs are from NCBI. Honestly, I don't know how to do this second part: "you could include taxid_map in makeblastdb arguments and make use of that in blastp". It would be extremely useful to me to learn this. Please-please-please, give me more details! Many-many thanks!

ADD REPLY
0
Entering edit mode

Dear 5heikki,

That's really great, thank you very much! My goal is to find a correct way to search for the best hit.

It turned out to be rather complicated. "Sorting for best hit" - could you, please, explain to me, how to do it correctly!

My seqs are from NCBI. Honestly, I don't know how to do this second part: "you could include taxid_map in makeblastdb arguments and make use of that in blastp". It would be extremely useful to me to learn this. Please-please-please, give me more details!

Many-many thanks!

P.S. I'm not sure where I have to place my answer, so I've just repeat it here.

ADD REPLY
0
Entering edit mode

This sorts tabular blast output (-outfmt 6) for best hit based on 1) bitscore, 2) evalue, 3) % identity:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
sort -k1,1 -k12,12gr -k11,11g -k3,3gr blastOut.tsv | sort -u -k1,1 --merge > bestHits

But before that you might have to exclude self hits, something like:

awk '$1 != $2' blastOut.tsv > blastOut.tsv.noSelfHits

As to taxid_map, this file links all protein GIs to NCBI taxon IDs, so you can use it with the taxid_map argument. But then you also need to use the '-parse_seqids' flag in makeblastdb. Then you also need NCBI tax_db in your db dir. I believe all this is documented and explained further in blast manual. Anyway, then when you have this stuff working you can do e.g. -outfmt '6 std sscinames' to have 13 column output with species name as the last column..

ADD REPLY
0
Entering edit mode

5heikki <http: www.biostars.org="" u="" 9490=""/>

2014-06-14 16:56 GMT+04:00 5heikki on Biostar <notifications@biostars.org>:

ADD REPLY

Login before adding your answer.

Traffic: 2322 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6