Question

Populate data frame column X with substring from column Y using R

0

Entering edit mode

8.2 years ago

ddowlin ▴ 70

Hi all,

I have a data frame in R with a column of gene identifiers (taken from fasta headers). These vary and include both Ensembl style (e.g. ENSP000000001) and NCBI style (e.g gi|123|ref|XP_000001.1|) as well as others.

I want to extract the accession and version numbers from the NCBI identifiers and create a new column as part of my data frame. Non-NCBI identifiers would have an NA in this column.

For example, I would like to change the following data frame:

df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|',
                           'gi|1234567|ref|XP_001267.1|', 'ENSP00000124')
                  )

To this:

                             gene   accession
1                ENSP00000123        <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4                ENSP00000124        <NA>

I have tried using regmatches but this is not working the way I want it to.

df1$accession <- regmatches(df1$gene, regexpr("XP_[0-9]+\\.*[0-9]*", df1$gene))

# results in:

                         gene   accession
1                ENSP00000123 XP_001234.1
2 gi|1234567|ref|XP_001234.1| XP_001267.1
3 gi|1234567|ref|XP_001267.1| XP_001234.1
4                ENSP00000124 XP_001267.1

Any help is greatly appreciated. Thanks in advance.

R • 9.5k views

ADD COMMENT • link updated 8.2 years ago by cpad0112 21k • written 8.2 years ago by ddowlin ▴ 70

score 2 · Answer 1 · 2017-08-23

2

Entering edit mode

8.2 years ago

ddowlin ▴ 70

Well, I quickly found a solution using stringr and dplyr here.

library(stringr)
library(dplyr)

df1 <- 
df1 %>%
mutate(accession = str_extract(gene, "XP_[0-9]+\\.*[0-9]*"))

gives:

                              gene         accession
1                     ENSP00000123           <NA>
2      gi|1234567|ref|XP_001234.1|    XP_001234.1
3      gi|1234567|ref|XP_001267.1|    XP_001267.1
4                     ENSP00000124           <NA>

ADD COMMENT • link 8.2 years ago by ddowlin ▴ 70

0

Entering edit mode

Very impressive. It is much better than my solution. I have to read more about tidyverse

ADD REPLY • link 8.2 years ago by e.rempel ★ 1.1k

score 1 · Answer 2 · 2017-08-23

This is rather a question concerning R language, so you are advised to put it on StackOverflow. Here is my attempt to solve it assuming that the NCBI identifier is always on the same (in this case 4th) position (counting the | as separators):

position_id <- 4
df1$accession <- NA
df1$accession[grep(pattern="XP_", x=df1$gene)] <- limma::strsplit2(x=grep(pattern="XP_", x=df1$gene, value=T), split="\\|")[,position_id]

score 0 · Answer 3 · 2017-08-23

0

Entering edit mode

8.2 years ago

cpad0112 21k

$ library(stringr)
$ df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|','gi|1234567|ref|XP_001267.1|', 'ENSP00000124'), stringsAsFactors = F)
$ df1$acession=str_extract(ifelse(grepl("xp", ignore.case = T,df1$gene),df1$gene,NA),"XP_[0-9]+.[0-9]")

output:

> df1
                         gene    acession
1                ENSP00000123        <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4                ENSP00000124        <NA>

ADD COMMENT • link 8.2 years ago by cpad0112 21k

0

Entering edit mode

> library(tidyr)
> df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|','gi|1234567|ref|XP_001267.1|', 'ENSP00000124'), stringsAsFactors = F)
> df1$accession=separate(df1 , gene, sep = "\\|*\\|",c("","","",""))[,4]

output:

> df1
                         gene   accession
1                ENSP00000123        <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4                ENSP00000124        <NA>

ADD REPLY • link 8.2 years ago by cpad0112 21k