collecting 50 most frequent proteins from tabular blastX result

0

Entering edit mode

8.0 years ago

Farbod ★ 3.4k

Dear Friends, Hi

I have done a blastX against NCBI nr database (using Diamond and keeping -max_target_seqs = 1) with outfmt 6.

I want to collect 50 proteins with the most frequent occurance in my results.

Is there any command line sccript or program for doing this task?

(I have tried cutting the column of the IDs and then openning it in Microsoft excel and count the duplicates and . . . but opening such file and running the duplicate count in my Windows system computer which is not very powerful is very difficult)

Thank you in advance

blast • 1.7k views

ADD COMMENT • link 8.0 years ago by Farbod ★ 3.4k

1

Entering edit mode

Perhaps this would help (see @Pierre's answer or python scripts if that is not going to help): Blastp how to find and count duplicates?..

ADD REPLY • link 8.0 years ago by GenoMax 147k

0

Entering edit mode

Dear genomax2, Hi & thank you.

but I could not understand that what is the final correct python script ?

ADD REPLY • link 8.0 years ago by Farbod ★ 3.4k

1

Entering edit mode

Simple: cut -f 1 blast_out.tbl | sort | uniq -c | sort -k1gr |head -50

ADD REPLY • link 8.0 years ago by Asaf 10k

0

Entering edit mode

Dear Asef, Hi

It seems that it is magically working!

Thank you

ADD REPLY • link 8.0 years ago by Farbod ★ 3.4k

2

Entering edit mode

No magic, just simple unix command liners

ADD REPLY • link 8.0 years ago by Asaf 10k

0

Entering edit mode

Dear Asef,

it seems that your script has two sort commands in it, can we reduce it to just one ?

~ Best

ADD REPLY • link 8.0 years ago by Farbod ★ 3.4k

0

Entering edit mode

Probably not. You can start at left and keep running the commands, every-time adding one more term (from the pipes) to see why not.

 cut -f 1 blast_out.tbl | less
 cut -f 1 blast_out.tbl | sort | less
 cut -f 1 blast_out.tbl | sort | uniq -c | less

You get the idea.

ADD REPLY • link 8.0 years ago by GenoMax 147k

Login before adding your answer.