How to Deduplicate files
1
0
Entering edit mode
5.6 years ago

Hi,

I have list of 160111 protein files. Some of the files are duplication as GCA and GCF id contains same protein sequnces. How I can deduplicate the list on the basis of ASM102201v1?

Enterobacter_hormaechei-158836#GCA_001022015.1/GCA_001022015.1_ASM102201v1_protein.faa
Enterobacter_cloacae-550#GCF_001022015.1/GCF_001022015.1_ASM102201v1_protein.faa
sequence • 837 views
ADD COMMENT
0
Entering edit mode
5.6 years ago

Try this:

$ awk -v FS="#" '{match($2, /(ASM[^_]+)/, asm)} !seen[asm[1]]++' input_file
ADD COMMENT

Login before adding your answer.

Traffic: 1860 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6