Entering edit mode
2.3 years ago
Neel
▴
20
Hi, i have almost 990798 hypothetical protein across 300 strains, so my question is how i can remove duplicates from it ? Actually i have try to sort but it gave same number because header is different that's why i think it consider all hypothetical header unique but their sequence must be same for few protein in two different strains.
Thank you!
you have to show us an example of input.
Are you interested in removing sequences that have
hypothetical
word in header or actually sequences that are duplicates (irrespective of what they say in the header).For first case, you can do (Pierrer's fasta code)
for latter, you will need to use a program like
cd-hit
that actually looks at the sequence.Thank you so much for your reply, actually i want to remove duplicates sequences.