Question

Any way to identify list of bed file with complicated pattern?

0

Entering edit mode

8.7 years ago

Jurat Shahidin ▴ 100

I have list of bed files that needed to used in my devel package as an external data. Because these bed files has rather complicated pattern, and I intend to get pair of bed files with same pattern for my packages use (i.e., pair of bed file with same pattern could be two or three in my data set, please look at my example). However, I gave my shot to make this happen, but my solution is not efficient and I got error instead. Can anyone point me out how to make this happen easily? How to efficiently manipulate file names with complicated pattern in R ? How to achieve my desired function output ? Any quick solution to solve this issue? Thanks a lot :)

Here is my external data set in my packages, list of bed files in extdata :

myPkg
    - inst
         - extdata
             - wgEncodeOpenChromChipK562CmycAlnRep1.bed
             - wgEncodeOpenChromChipK562CmycAlnRep2.bed
             - wgEncodeOpenChromChipK562CmycAlnRep3.bed
             - wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep2.bed
             - wgEncodeSydhTfbsK562CmycIfna30StdAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycIfna30StdAlnRep2.bed
             - wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep2.bed
             - wgEncodeSydhTfbsK562CmycIggrabAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycIggrabAlnRep2.bed
             - wgEncodeSydhTfbsK562CmycStdAlnRep1.bed
             - wgEncodeSydhTfbsK562CmycStdAlnRep2.bed
    - R

Here is part of my solution :

getExtDat <- function() {
  dir <- system.file("extdata", package="myPkg")
  files <- list.files(dir)
  # FIXME : I got error, make it more dynamic
  #patt <- gsub(pattern='wgE\\d+_(\\w+_\\w+)_.*', replacement='\\1',files)
  res <- paste(dir, files, sep="/")
  res <- as.list(res)
  return(res)
}

print(getExtDat)

And getExtdat object passed to this function :

myFunc <- function(getExtDat) {
  files <- list.files(getExtDat, full.names = TRUE, "\\.bed$")
  readMe <- lapply(files, import.bed)
  return(readMe)
}

Edit example :

let getExtDat find pair of bed files with same pattern (

for example, I need these bed files as an input for myFunc :

wgEncodeOpenChromChipK562CmycAlnRep1.bed, 
wgEncodeOpenChromChipK562CmycAlnRep2.bed, wgEncodeOpenChromChipK562CmycAlnRep3.bed

or I need these bed files as an input for myFunc :

wgEncodeSydhTfbsK562CmycIggrabAlnRep1.bed
wgEncodeSydhTfbsK562CmycIggrabAlnRep2.bed

or I need these bed files as an input for myFunc:

wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep1.bed
wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep2.bed

are my objective to find by getExtDat, instead of loading all bed files ), and pass getExtDat to myFunc.

How to make getExtDat function to find desired paired bed files available for myFunc as input ? How to correctly identify desired pattern of bed files more dynamically ? Any way to more programmatic solution for finding desired bed files with complicated pattern ? Any idea please ?

ChIP-Seq r bed sequencing • 2.6k views

ADD COMMENT • link 8.7 years ago by Jurat Shahidin ▴ 100

2

Entering edit mode

I do not know much R, but what I would do is something like,

li <- c("wgEncodeOpenChromChipK562CmycAlnRep1.bed", "wgEncodeOpenChromChipK562CmycAlnRep2.bed", "wgEncodeOpenChromChipK562CmycAlnRep3.bed", "wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep1.bed", "wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep2.bed")

replicates <- unique(gsub("wgEncode|Rep[0-9].bed", "", li))

replicates
[1] "OpenChromChipK562CmycAln"     "SydhTfbsK562CmycIfna6hStdAln"

#Then a for loop for each replicate,

grep(replicates[1], li, value=TRUE)
[1] "wgEncodeOpenChromChipK562CmycAlnRep1.bed"
[2] "wgEncodeOpenChromChipK562CmycAlnRep2.bed"
[3] "wgEncodeOpenChromChipK562CmycAlnRep3.bed"

grep(replicates[2], li, value=TRUE)
[1] "wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep1.bed"
[2] "wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep2.bed"

ADD REPLY • link 8.7 years ago by GouthamAtla 12k

1

Entering edit mode

@Goutham Atla: That's what I was also thinking... (I think there shouldn't be spaces around '|' in the gsub pattern)

ADD REPLY • link 8.7 years ago by dariober 15k

0

Entering edit mode

Thanks, edited it.

ADD REPLY • link 8.7 years ago by GouthamAtla 12k

0

Entering edit mode

@Goutham Atla : I edited my example as well, Is that possible to make your solution more programmatic ? I am wondering any dynamic solution is possibly happen or not. Thanks a lot :)

ADD REPLY • link 8.7 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

Dear Goutham:

Thanks for your quick respond here. Yes, name of bed files is rather long, I designed function for accepting list of bed files (for example, myFunc accept at most tow or three peak files with same pattern as an input, that I showed in my example). Could you have more general solution? what if I choose other bed files with same pattern in my external data ? Is that possible to make your solution more programmatic ? Plus, here is my devel package hosted on github. Thanks a lot :)

ADD REPLY • link 8.7 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

I don't think there can be a general solution. It depends on what defines the pattern that captures replicates, in your example stripping "Rep.*" is enough, but can you be sure that it is sufficient for other files? If the answer is no, than you could check that the number of files capture by each pattern is, say, between 2 and 4, but it may not be very robust.

I guess very much in theory one could have some probabilistic, machine learning strategy to find patterns discriminating groups but I'm pretty sure you don't want to go there for the job you describe! For example one could compute the edit distance between pairs of file names and cluster names on the base of that similarity distance.

ADD REPLY • link 8.7 years ago by dariober 15k

0

Entering edit mode

Dear dariober:

Thanks for your concern on my question. If you look at external data in my devel package , I must choose any three or two bed files for my package use. I think Goutham's solution so close to generate more robust solution. Maybe you could give me more interesting approach to solve this problem. Thank you :)

Best regards :

Jurat

ADD REPLY • link 8.7 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

Your package seems very interesting by the way, keep it up! Combining weak peaks is something I have been wondering about myself...

ADD REPLY • link 8.7 years ago by dariober 15k

0

Entering edit mode

Dear dariober :

Thanks for you are being supportive, appreciated. Yes, it could be interesting packages will be hosted in Bioconductor. Any possible contribution from Biostar community is very appreciated. Thanks again.

Best regards :

Jurat

ADD REPLY • link 8.7 years ago by Jurat Shahidin ▴ 100