Question

Unusual behavior of grepl in R.

0

Entering edit mode

9.7 years ago

unique379 ▴ 120

Deal all,

I need a explanation why grep function behave usually? Or its just my wrong interpretation?

I have list of character that need to extract from matrix/data frame by row.names. To do this I tried with few subset and grepl function.

> head(miRNAs) ### list of character
[1] "hsa-mir-200b"  "hsa-mir-200a"  "hsa-mir-429"   "hsa-mir-1256" 
[5] "hsa-mir-101-1" "hsa-mir-1262"
> length(miRNAs) ## length
[1] 129
> list=as.character(paste(miRNAs, collapse="|"))
> # with subset and grepl
> extract1=subset(expMatrix, grepl(list, row.names(expMatrix)))
> nrow(extract1)
[1] 150   ## found greater length than actual list, why?
> # without subset
> extract2=expMatrix[grepl(list, row.names(expMatrix)),]
> nrow(extract2)
[1] 150 ## Same here; found greater length than actual list, why?
> # with only subset
> extract3=subset(expMatrix,row.names(expMatrix) %in% miRNAs)
> nrow(extract3)
[1] 129 ## its perfect
> ## without subset and grepl
> extract4=expMatrix[miRNAs, ]
> nrow(extract4) ## its perfect too
[1] 129

So here I have two queries:

Why is grepl behavior odd? With or without subset?
Which trick is suitable to extract list of character from matrix/data frame? extract3 or extract4 which one?

Thanks

R • 2.5k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.7 years ago by unique379 ▴ 120

Ram · Answer 1 · 2015-11-10

3

Entering edit mode

9.7 years ago

Devon Ryan 105k

It's quite likely that you have something like hsa-mir-200 as a row name, which will match hsa-mir-200a and hsa-mir-200b with grepl but not %in% or directly subsetting.

ADD COMMENT • link 9.7 years ago by Devon Ryan 105k

0

Entering edit mode

indeed its not the case of hsa-mir-200 and hsa-mir-200a. The extra rows are as follows:

21 miR included exclusively in "150":
hsa-mir-3121
hsa-mir-1278
hsa-mir-936
hsa-mir-3163
hsa-mir-3164
hsa-mir-3166
hsa-mir-3173
hsa-mir-3174
hsa-mir-3176
hsa-mir-3177
hsa-mir-3178
hsa-mir-3182
hsa-mir-3187
hsa-mir-1270-1
hsa-mir-1270-2
hsa-mir-3136
hsa-mir-3138
hsa-mir-1271
hsa-mir-1275
hsa-mir-939
hsa-mir-500b ## it could be the same case

However, I observed miR such as hsa-mir-3172 are in above list hsa-mir-3164, hsa-mir-3173 etc. If this is the case and string match only few character not whole word then, there is any argument that I can enable into grepl?? like in linux grep -w (force PATTERN to match only whole words).

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by unique379 ▴ 120

1

Entering edit mode

If hsa-mir-31 (among a couple others) were in your original list then you could get something like this. There's no point in using grep in R for whole word matches, which is why the option isn't there (though you can always use ^ and $ to denote searching for word bounds).