Hi everybody,
I did a blastp on linux command line (about 7,500 genes), outfmt6, I did a filter (e-value, max_hsps, etc), just one hit per query, and it works very well, but I found that some genes have hit with the same target and now I want to now how many genes have the same target hit, so what I did was; cut -f 2 result_blastp_outfmt6.txt > to_find_duplicates.txt
, column 2 have the target accession number and then, this is my python script: It Find target names duplicates (2 or more times) and print them.
It works, but I want to know how many times is present each duplicate on blast result, I'm a noob in python and i don't get it until now :(
some advices??!
#find duplicates and then write them in duplicates.txt file
file = open(sys.argv[1],'r')
list = {}
for elem in file:
if elem in list:
list[elem] += 1
else:
list[elem] = 1
dups = [x for x, y in list.items() if y > 1]
file_out = open('duplicates.txt', 'w')
for line in dups:
file_out.write(line)
file_out.close()
what about just using
cut | sort | uniq -d
?No, uniq -d does the same that my script, anyways I didnĀ“t know that command, thank`s!