Hi,
I have a tab-limited text file which has the IDs in column number 1 and the corresponding HMM name in column number 7 as shown below.
gi|336321007|ref|YP_004600975.1| adh_short_C2
gi|336321007|ref|YP_004600975.1| adh_short
gi|336321007|ref|YP_004600975.1| KR
gi|557685240|ref|YP_008788710.1| PS-DH
gi|557685240|ref|YP_008788710.1| adh_short_C2
gi|557685240|ref|YP_008788710.1| adh_short
gi|557685240|ref|YP_008788710.1| KR
gi|557685240|ref|YP_008788710.1| ketoacyl-synt
gi|557685240|ref|YP_008788710.1| Ketoacyl-synt_C
. .
.
.
I want to select all the rows having 'adh_short_C2' or 'adh_short' or 'KR' for every unique sequence ID in column 1. Ex. gi|336321007|ref|YP_004600975.1| in this case.
And delete all the rows which have other HMM names in addition to 'adh_short_C2' or 'adh_short' or 'KR' for every single ID. Ex. gi|557685240|ref|YP_008788710.1| in this case.
Desired output - rows containing the IDs which have only 'adh_short_C2' or 'adh_short' or 'KR' and no other HMM names.
I tried this code but it doesn't work well as it also picks up the IDs having other HMM names as well
adh_short_C2_list <- subset(adh_short_C2, select=`seq id`)
adh_short_list <- subset(adh_short, select=`seq id`)
How to execute these two conditions together or step-by-step?
data:
Code
Result
The desired output should be like:
gi|336321007|ref|YP_004600975.1 adh_short_C2
gi|336321007|ref|YP_004600975.1 adh_short
gi|336321007|ref|YP_004600975.1 KR