obtaining first two words from characters of the data frame
3
0
Entering edit mode
2.7 years ago
Ne ▴ 10

I would like to extract the first two words from characters. For example,

y <- data.frame(name = c('london hilss sff', 'newyork hills fff', 'paris'))

I want to get words less or equal 2;

name

'london hilss'

'newyork hills'

'paris'

R • 814 views
ADD COMMENT
2
Entering edit mode
2.7 years ago
Malcolm.Cook ★ 1.5k
gsub('^(\\S*\\s*\\S*).*$','\\1',y$name)
[1] "london hilss"  "newyork hills" "paris"

regular expressions FTW!

edit: used \\S to capture "words" instead of \\w, allowing all non-whitespace characters to be part of "words"

ADD COMMENT
1
Entering edit mode
2.7 years ago

split the character vector by space, then get first two.

> y <- data.frame(name = c('london hilss sff', 'newyork hills fff', 'paris'))
> library(stringr)
> library(tidyr)
> str_to_sentence(unite(data.frame(str_split(y$name," ",3, simplify = T)[,c(1:2)]), "new", sep = " ")$new)
[1] "London hilss"  "Newyork hills" "Paris " 
ADD COMMENT
1
Entering edit mode
2.7 years ago
Julian ▴ 20

Edit: I think I prefer Malcolm's response below! Much shorter and simpler, although maybe less readable.


You can split like suggested by cpad--that's simplest.

Like this:

> firstN <- function(x, n) {
     words <- strsplit(x, " ")[[1]]
     paste(words[1:min(2, length(words))], collapse = " ")
 }

> sapply(y$name, FUN = function(x) firstN(x, 2), USE.NAMES = F)
[1] "london hilss"  "newyork hills" "paris"

I had to make firstN because if you ask for c("Test")[1:2], for example, you'll get an NA.


Alternatively you can use the word function from stringr.

The base function works for strings that have at least two words:

> library(stringr)
> y <- data.frame(name = c('london hilss sff', 'newyork hills fff', 'paris'))
> word(y$name, 1, 2)
[1] "london hilss"  "newyork hills" NA

Although unfortunately it doesn't work for just one word.

You can hack together something that fixes that, though, like this:

words_or_fewer <- function(str, n) {
    answer <- word(str, start = 1, end = n)

    while(n > 0) {
        # If the answer is NA, try to get fewer words
        if(is.na(answer)) {
            n <- n - 1
            answer <- word(str, start = 1, end = n)
        } else {
            break()
        }
    }
    answer
}

# Just a wrapper to use words_or_fewer with vectors
words_or_fewer_vec <- function(str_vec, n) {
    sapply(str_vec, FUN = function(str) words_or_fewer(str, n), USE.NAMES = F)
}
> words_or_fewer_vec(y$name, 2)
[1] "london hilss"  "newyork hills" "paris"
ADD COMMENT

Login before adding your answer.

Traffic: 1248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6