Entering edit mode
8.0 years ago
Farbod
★
3.4k
Dear Friends, Hi
I have done a blastX against NCBI nr database (using Diamond and keeping -max_target_seqs = 1) with outfmt 6.
I want to collect 50 proteins with the most frequent occurance in my results.
Is there any command line sccript or program for doing this task?
(I have tried cut
ting the column of the IDs and then openning it in Microsoft excel and count the duplicates and . . . but opening such file and running the duplicate count in my Windows system computer which is not very powerful is very difficult)
Thank you in advance
Perhaps this would help (see @Pierre's answer or python scripts if that is not going to help): Blastp how to find and count duplicates?..
Dear genomax2, Hi & thank you.
but I could not understand that what is the final correct python script ?
Simple: cut -f 1 blast_out.tbl | sort | uniq -c | sort -k1gr |head -50
Dear Asef, Hi
It seems that it is magically working!
Thank you
No magic, just simple unix command liners
Dear Asef,
it seems that your script has two
sort
commands in it, can we reduce it to just one ?~ Best
Probably not. You can start at left and keep running the commands, every-time adding one more term (from the pipes) to see why not.
You get the idea.