Hi all,
I have a data frame in R with a column of gene identifiers (taken from fasta headers). These vary and include both Ensembl style (e.g. ENSP000000001
) and NCBI style (e.g gi|123|ref|XP_000001.1|
) as well as others.
I want to extract the accession and version numbers from the NCBI identifiers and create a new column as part of my data frame. Non-NCBI identifiers would have an NA in this column.
For example, I would like to change the following data frame:
df1 <- data.frame(gene = c('ENSP00000123', 'gi|1234567|ref|XP_001234.1|',
'gi|1234567|ref|XP_001267.1|', 'ENSP00000124')
)
To this:
gene accession
1 ENSP00000123 <NA>
2 gi|1234567|ref|XP_001234.1| XP_001234.1
3 gi|1234567|ref|XP_001267.1| XP_001267.1
4 ENSP00000124 <NA>
I have tried using regmatches
but this is not working the way I want it to.
df1$accession <- regmatches(df1$gene, regexpr("XP_[0-9]+\\.*[0-9]*", df1$gene))
# results in:
gene accession
1 ENSP00000123 XP_001234.1
2 gi|1234567|ref|XP_001234.1| XP_001267.1
3 gi|1234567|ref|XP_001267.1| XP_001234.1
4 ENSP00000124 XP_001267.1
Any help is greatly appreciated. Thanks in advance.
Very impressive. It is much better than my solution. I have to read more about tidyverse