Question

Filtering rows based on specific conditions

0

Entering edit mode

7.5 years ago

Promi ▴ 10

Hi,

I have a tab-limited text file which has the IDs in column number 1 and the corresponding HMM name in column number 7 as shown below.

gi|336321007|ref|YP_004600975.1| adh_short_C2

gi|336321007|ref|YP_004600975.1| adh_short

gi|336321007|ref|YP_004600975.1| KR

gi|557685240|ref|YP_008788710.1| PS-DH

gi|557685240|ref|YP_008788710.1| adh_short_C2

gi|557685240|ref|YP_008788710.1| adh_short

gi|557685240|ref|YP_008788710.1| KR

gi|557685240|ref|YP_008788710.1| ketoacyl-synt

gi|557685240|ref|YP_008788710.1| Ketoacyl-synt_C

.   .

.
.

I want to select all the rows having 'adh_short_C2' or 'adh_short' or 'KR' for every unique sequence ID in column 1. Ex. gi|336321007|ref|YP_004600975.1| in this case.

And delete all the rows which have other HMM names in addition to 'adh_short_C2' or 'adh_short' or 'KR' for every single ID. Ex. gi|557685240|ref|YP_008788710.1| in this case.

Desired output - rows containing the IDs which have only 'adh_short_C2' or 'adh_short' or 'KR' and no other HMM names.

I tried this code but it doesn't work well as it also picks up the IDs having other HMM names as well

adh_short_C2_list <- subset(adh_short_C2, select=`seq id`)

adh_short_list <- subset(adh_short, select=`seq id`)

How to execute these two conditions together or step-by-step?

pfam data filtering • 1.6k views

ADD COMMENT • link updated 7.5 years ago by GenoMax 147k • written 7.5 years ago by Promi ▴ 10

0

Entering edit mode

data:

                       V1              V2
 gi|336321007|ref|YP_004600975.1    adh_short_C2
 gi|336321007|ref|YP_004600975.1       adh_short
 gi|336321007|ref|YP_004600975.1              KR
 gi|557685240|ref|YP_008788710.1           PS-DH
 gi|557685240|ref|YP_008788710.1    adh_short_C2
 gi|557685240|ref|YP_008788710.1       adh_short
 gi|557685240|ref|YP_008788710.1              KR
 gi|557685240|ref|YP_008788710.1   ketoacyl-synt
 gi|557685240|ref|YP_008788710.1 Ketoacyl-synt_C

Code

library(dplyr)
data1=read.csv("test.txt", sep="\t", header = F)
View(data1)
filter(data1, V2 %in% c("KR","adh_short_C2"))

Result

> filter(data1, V2 %in% c("KR","adh_short_C2"))
                               V1           V2
1 gi|336321007|ref|YP_004600975.1 adh_short_C2
2 gi|336321007|ref|YP_004600975.1           KR
3 gi|557685240|ref|YP_008788710.1 adh_short_C2
4 gi|557685240|ref|YP_008788710.1           KR

ADD REPLY • link 7.5 years ago by cpad0112 21k

0

Entering edit mode

The desired output should be like:

gi|336321007|ref|YP_004600975.1 adh_short_C2

gi|336321007|ref|YP_004600975.1 adh_short

gi|336321007|ref|YP_004600975.1 KR

ADD REPLY • link 7.5 years ago by Promi ▴ 10