Question

Blastp how to find and count duplicates?..

2

Entering edit mode

9.0 years ago

Buffo ★ 2.4k

Hi everybody, I did a blastp on linux command line (about 7,500 genes), outfmt6, I did a filter (e-value, max_hsps, etc), just one hit per query, and it works very well, but I found that some genes have hit with the same target and now I want to now how many genes have the same target hit, so what I did was; cut -f 2 result_blastp_outfmt6.txt > to_find_duplicates.txt, column 2 have the target accession number and then, this is my python script: It Find target names duplicates (2 or more times) and print them. It works, but I want to know how many times is present each duplicate on blast result, I'm a noob in python and i don't get it until now :(

some advices??!

#find duplicates and then write them in duplicates.txt file
file = open(sys.argv[1],'r')
list = {}
for elem in file:
    if elem in list:
        list[elem] += 1
    else:
        list[elem] = 1
dups = [x for x, y in list.items() if y > 1]

file_out = open('duplicates.txt', 'w')
for line in dups:
    file_out.write(line)
file_out.close()

blast gene outfmt6 python script • 2.9k views

ADD COMMENT • link updated 9.0 years ago by piet ★ 1.9k • written 9.0 years ago by Buffo ★ 2.4k

2

Entering edit mode

what about just using cut | sort | uniq -d ?

ADD REPLY • link 9.0 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

No, uniq -d does the same that my script, anyways I didn´t know that command, thank`s!

ADD REPLY • link 9.0 years ago by Buffo ★ 2.4k

score 2 · Answer 1 · 2016-07-23

2

Entering edit mode

9.0 years ago

piet ★ 1.9k

I want to know how many times is present each duplicate on blast result

your python script needs only minor changes to accomplish that.

for line, count in list.iteritems() if count > 1:
    sys.stdout.write("%i\t%s" % (count,line,))

In this case you want to "aggregate" (collect) data across several lines of the blast output. Problems like this (sorting and aggregating) can be solved with python efficiently. I would recommend the "Python Cookbook" by Alex Martelli to learn more about how to write short and efficient python code for string processing and sorting.

Coding style: 'list' and 'file' are names build-in into python. You should never use build-in names for your own variables.

I would also recommend to read the blast output directly with python:

acc_count = {}
for line in open("result_blastp_outfmt6.txt"):
    fields = line.split("\t")
    acc = fields[1] # zero-based indexing
    acc_count[acc] = acc_count.get(acc,0) + 1 # this is the TYPICAL pythonic idiom for counting

ADD COMMENT • link 9.0 years ago by piet ★ 1.9k

1

Entering edit mode

Thanks piet, I tried to add your suggestion but it doesn`t works:

for line, count in list.iteritems() if count > 1:
SyntaxError: invalid syntax

I added that after

dups = [x for x, y in list.items() if y > 1

That`s right ? Or it was instead of?

Thanks piet!

ADD REPLY • link 9.0 years ago by Buffo ★ 2.4k

2

Entering edit mode

Yes, I see. The syntax is wrong. The if clause has to be moved into the for loop. Try out the variants, this will help you understand programming.

ADD REPLY • link 9.0 years ago by piet ★ 1.9k

0

Entering edit mode

for line, count in list.iteritems():
    if count > 1:
        sys.stdout.write("%i\t%s" % (count,line,))

I love it, it works perfect! I will try also read directly the blast output! thank you so much piet!

ADD REPLY • link 9.0 years ago by Buffo ★ 2.4k