Extracting a specific string pattern from a list of objects
2
0
Entering edit mode
6.1 years ago
dodausp ▴ 190

Hi,

Here is a recurrent problem I face from time to time, specially because I would rather have object names that resemble my file names other than creating a list with different names for those given files. It is just easier for me to track down a file in case of troubleshooting. So, my question is how can I extract a pattern on my object rather than a specific string of characters? For example:

vcfs <- list.files()

vcfs
[1] "OV-TCGA-05-1456-01.vcf"   "OV-TCGA-05-4578-01.vcf"   "OV-TCGA-08-5666-01.vcf"   "LUSC-TCGA-10-5684-01.vcf" "LUAD-TCGA-02-6574-01.vcf"

So, as you can see, the first part of each file defines the cohort type (OV, LUSC, LUAD) and the rest after "TCGA" is unique to each one of them. I would like to (1) remove the hyphens, (2) keep the cohort name, and (3) keep the 6 digits coming after "TCGA". So it should look like this:

"OV051456", "OV054578", "OV085666", "LUSC105684", "LUAD026574"

Now, I always struggle using those symbols (*, ., ?, \", "") to extract a string from a character object. So, if in addition any of you could also recommend me where to find a good tutorial on those, I would truly appreciate it. And sorry by the simple question. I am not a hardcore bioinformatician. And I love how this community is always so engaging and helpful.

So, thanks a lot in advance!

Cheers,

Douglas

R string subsetting data TCGA SNP • 1.8k views
ADD COMMENT
0
Entering edit mode

I also find regex confusing, and often refer to this site to help me:

https://regexr.com/

It has good information, cheatsheets, guides, and a live editor in which you can play around with your expressions.

ADD REPLY
3
Entering edit mode
6.1 years ago
Russ ▴ 520

There's probably a nifty regex one liner that will accomplish the task more efficiently, but my strategy is simple and works:

vcf1 <- gsub("-TCGA-", "", vcf)
vcf2 <- gsub("-[0-9][0-9].vcf", "", vcf1)
vcf3 <- gsub("-", "", vcf2)

> vcf3
[1] "OV051456"   "OV054578"   "OV085666"   "LUSC105684" "LUAD026574"
ADD COMMENT
1
Entering edit mode

That was really helpful! It worked just fine for all the files I had, with no errors. Also, I liked the way you put it up. It made it very easy for me to understand each step of the routine.

Thanks a lot, Russ!

ADD REPLY
2
Entering edit mode
6.1 years ago
jweile ▴ 20

Having run into this kind of problem several times as well, I have since written a little helper function for these types of situation:

#' Extract regex groups (local)
#' 
#' Locally excise regular expression groups from string vectors.
#' I.e. only extract the first occurrence of each group within each string.
#' 
#' @param x A vector of strings from which to extract the groups.
#' @param re The regular expression defining the groups
#' @return A \code{matrix} containing the group contents, 
#'      with one row for each element of x and one column for each group.
#' @keywords regular expression groups
#' @export
extract.groups <- function(x, re) {
    matches <- regexpr(re,x,perl=TRUE)
    start <- attr(matches,"capture.start")
    end <- start + attr(matches,"capture.length") - 1
    do.call(cbind,lapply(1:ncol(start), function(i) {
        sapply(1:nrow(start),function(j){
            if (start[j,i] > -1) substr(x[[j]],start[j,i],end[j,i]) else NA
        })
    }))
}

For your specific problem, it can be used as follows:

> groups <- extract.groups(vcfs,"^(\\w+)+-TCGA-(\\d{2})-(\\d{4})-01.vcf$")
> groups
     [,1]   [,2] [,3]  
[1,] "OV"   "05" "1456"
[2,] "OV"   "05" "4578"
[3,] "OV"   "08" "5666"
[4,] "LUSC" "10" "5684"
[5,] "LUAD" "02" "6574"
> output <- apply(groups,1,paste,collapse="")
> output
[1] "OV051456"   "OV054578"   "OV085666"   "LUSC105684" "LUAD026574"
ADD COMMENT
1
Entering edit mode

I tried your routine and it worked just nicely as well. And really nice that you also went all the way to explain what the function does. And I was glad to know that I am not alone on this "challenging" issue.

This community is ace! Thanks a lot, jweile!

ADD REPLY

Login before adding your answer.

Traffic: 1860 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6