how to discard same hypothetical protein in 300 strains protein file

0

Entering edit mode

3.1 years ago

Neel ▴ 20

Hi, i have almost 990798 hypothetical protein across 300 strains, so my question is how i can remove duplicates from it ? Actually i have try to sort but it gave same number because header is different that's why i think it consider all hypothetical header unique but their sequence must be same for few protein in two different strains.

enter image description here

Thank you!

annotation fasta • 934 views

ADD COMMENT • link 3.1 years ago by Neel ▴ 20

0

Entering edit mode

you have to show us an example of input.

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Are you interested in removing sequences that have hypothetical word in header or actually sequences that are duplicates (irrespective of what they say in the header).

For first case, you can do (Pierrer's fasta code)

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' your.fa | grep -v "hypothetical" | tr "\t" "\n" | fold -w 80 > clean.fa

for latter, you will need to use a program like cd-hit that actually looks at the sequence.

ADD REPLY • link 3.1 years ago by GenoMax 153k

0

Entering edit mode

Thank you so much for your reply, actually i want to remove duplicates sequences.

ADD REPLY • link 3.1 years ago by Neel ▴ 20

Login before adding your answer.