Question

each protein with each protein

0

Entering edit mode

10.9 years ago

natasha.sernova ★ 4.0k

Dear colleagues,

I have a lot of proteins. What is the

simplest way to compare each one

of them with all the other proteins?

Probably I have to make a database

and use a program like BlastP.

But how to make sure that I've checked

all the protein pairs I have?

I feel that I have to use "for" in Linux,

but I've failed to finish the cycle expression.

Please. help me!

Many thanks!

N.

blast • 2.8k views

ADD COMMENT • link updated 10.9 years ago by 5heikki 11k • written 10.9 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

You will need to learn how to compare one against one before attempting all vs all. Yes using "for" in linux will be important, but what do you hope to achieve? What kind of comparison? I feel this task is too large, all vs all is not going to be meaningful.

ADD REPLY • link 10.9 years ago by karl.stamm 4.1k

Ram · Answer 1 · 2014-06-14

2

Entering edit mode

10.9 years ago

5heikki 11k

The simplest:

blastp -query all.fasta -subject all.fasta

Better:

makeblastdb -in all.fasta -dbtype prot -title all -out all -hash_index
blastp -query all.fasta -db all -out all-vs-all -outfmt 6

Better still:

makeblastdb -in all.fasta -dbtype prot -title all -out all -hash_index
blastp -query all.fasta -db all -out all-vs-all -outfmt 6 -use_sw_tback -seg yes -soft_masking true -max_target_seqs 20000 -num_threads "MAXonYourSystem"
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
for i in {4..6}
do
    awk '$11 <= 1e-"$i"' all-vs-all > all-vs-all.1e-"$i"
done

And all the other stuff like e.g. sorting for best hits, but really, your question was way too vague to say what should be done. E.g., if the seqs are from NCBI, you could include taxid_map in makeblastdb arguments and make use of that in blastp..

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by 5heikki 11k

0

Entering edit mode

Dear 5heikki,

That's really great, thank you very much! My goal is to find a correct way to search for the best hit.

It turned out to be rather complicated. "Sorting for best hit" - could you, please, explain to me, how to do it correctly!

My seqs are from NCBI. Honestly, I don't know how to do this second part: "you could include taxid_map in makeblastdb arguments and make use of that in blastp". It would be extremely useful to me to learn this. Please-please-please, give me more details! Many-many thanks!

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Dear 5heikki,

That's really great, thank you very much! My goal is to find a correct way to search for the best hit.

It turned out to be rather complicated. "Sorting for best hit" - could you, please, explain to me, how to do it correctly!

My seqs are from NCBI. Honestly, I don't know how to do this second part: "you could include taxid_map in makeblastdb arguments and make use of that in blastp". It would be extremely useful to me to learn this. Please-please-please, give me more details!

Many-many thanks!

P.S. I'm not sure where I have to place my answer, so I've just repeat it here.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

This sorts tabular blast output (-outfmt 6) for best hit based on 1) bitscore, 2) evalue, 3) % identity:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
sort -k1,1 -k12,12gr -k11,11g -k3,3gr blastOut.tsv | sort -u -k1,1 --merge > bestHits

But before that you might have to exclude self hits, something like:

awk '$1 != $2' blastOut.tsv > blastOut.tsv.noSelfHits

As to taxid_map, this file links all protein GIs to NCBI taxon IDs, so you can use it with the taxid_map argument. But then you also need to use the '-parse_seqids' flag in makeblastdb. Then you also need NCBI tax_db in your db dir. I believe all this is documented and explained further in blast manual. Anyway, then when you have this stuff working you can do e.g. -outfmt '6 std sscinames' to have 13 column output with species name as the last column..

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by 5heikki 11k

0

Entering edit mode

5heikki <http: www.biostars.org="" u="" 9490=""/>

2014-06-14 16:56 GMT+04:00 5heikki on Biostar <notifications@biostars.org>:

ADD REPLY • link 10.9 years ago by natasha.sernova ★ 4.0k