Entering edit mode
9.0 years ago
unique379
▴
120
Deal all,
I need a explanation why grep function behave usually? Or its just my wrong interpretation?
I have list of character that need to extract from matrix/data frame by row.names. To do this I tried with few subset
and grepl
function.
> head(miRNAs) ### list of character
[1] "hsa-mir-200b" "hsa-mir-200a" "hsa-mir-429" "hsa-mir-1256"
[5] "hsa-mir-101-1" "hsa-mir-1262"
> length(miRNAs) ## length
[1] 129
> list=as.character(paste(miRNAs, collapse="|"))
> # with subset and grepl
> extract1=subset(expMatrix, grepl(list, row.names(expMatrix)))
> nrow(extract1)
[1] 150 ## found greater length than actual list, why?
> # without subset
> extract2=expMatrix[grepl(list, row.names(expMatrix)),]
> nrow(extract2)
[1] 150 ## Same here; found greater length than actual list, why?
> # with only subset
> extract3=subset(expMatrix,row.names(expMatrix) %in% miRNAs)
> nrow(extract3)
[1] 129 ## its perfect
> ## without subset and grepl
> extract4=expMatrix[miRNAs, ]
> nrow(extract4) ## its perfect too
[1] 129
So here I have two queries:
- Why is
grepl
behavior odd? With or without subset? - Which trick is suitable to extract list of character from matrix/data frame? extract3 or extract4 which one?
Thanks
indeed its not the case of
hsa-mir-200
andhsa-mir-200a
. The extra rows are as follows:However, I observed miR such as
hsa-mir-3172
are in above listhsa-mir-3164
,hsa-mir-3173
etc. If this is the case and string match only few character not whole word then, there is any argument that I can enable into grepl?? like in linuxgrep -w
(force PATTERN to match only whole words).If hsa-mir-31 (among a couple others) were in your original list then you could get something like this. There's no point in using grep in R for whole word matches, which is why the option isn't there (though you can always use ^ and $ to denote searching for word bounds).
Thanks Ryan fir your clue....its done :))