I have list of bed files that needed to used in my devel package as an external data. Because these bed files has rather complicated pattern, and I intend to get pair of bed files with same pattern for my packages use (i.e., pair of bed file with same pattern could be two or three in my data set, please look at my example). However, I gave my shot to make this happen, but my solution is not efficient and I got error instead. Can anyone point me out how to make this happen easily? How to efficiently manipulate file names with complicated pattern in R ? How to achieve my desired function output ? Any quick solution to solve this issue? Thanks a lot :)
Here is my external data set in my packages, list of bed files in extdata
:
myPkg
- inst
- extdata
- wgEncodeOpenChromChipK562CmycAlnRep1.bed
- wgEncodeOpenChromChipK562CmycAlnRep2.bed
- wgEncodeOpenChromChipK562CmycAlnRep3.bed
- wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep1.bed
- wgEncodeSydhTfbsK562CmycIfna6hStdAlnRep2.bed
- wgEncodeSydhTfbsK562CmycIfna30StdAlnRep1.bed
- wgEncodeSydhTfbsK562CmycIfna30StdAlnRep2.bed
- wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep1.bed
- wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep2.bed
- wgEncodeSydhTfbsK562CmycIggrabAlnRep1.bed
- wgEncodeSydhTfbsK562CmycIggrabAlnRep2.bed
- wgEncodeSydhTfbsK562CmycStdAlnRep1.bed
- wgEncodeSydhTfbsK562CmycStdAlnRep2.bed
- R
Here is part of my solution :
getExtDat <- function() {
dir <- system.file("extdata", package="myPkg")
files <- list.files(dir)
# FIXME : I got error, make it more dynamic
#patt <- gsub(pattern='wgE\\d+_(\\w+_\\w+)_.*', replacement='\\1',files)
res <- paste(dir, files, sep="/")
res <- as.list(res)
return(res)
}
print(getExtDat)
And getExtdat
object passed to this function :
myFunc <- function(getExtDat) {
files <- list.files(getExtDat, full.names = TRUE, "\\.bed$")
readMe <- lapply(files, import.bed)
return(readMe)
}
Edit example :
let getExtDat
find pair of bed files with same pattern (
for example, I need these bed files as an input for myFunc
:
wgEncodeOpenChromChipK562CmycAlnRep1.bed,
wgEncodeOpenChromChipK562CmycAlnRep2.bed, wgEncodeOpenChromChipK562CmycAlnRep3.bed
or I need these bed files as an input for myFunc
:
wgEncodeSydhTfbsK562CmycIggrabAlnRep1.bed
wgEncodeSydhTfbsK562CmycIggrabAlnRep2.bed
or I need these bed files as an input for myFunc
:
wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep1.bed
wgEncodeSydhTfbsK562CmycIfng6hStdAlnRep2.bed
are my objective to find by getExtDat
, instead of loading all bed files ), and pass getExtDat
to myFunc
.
How to make getExtDat
function to find desired paired bed files available for myFunc
as input ? How to correctly identify desired pattern of bed files more dynamically ? Any way to more programmatic solution for finding desired bed files with complicated pattern ? Any idea please ?
I do not know much R, but what I would do is something like,
#Then a for loop for each replicate,
@Goutham Atla: That's what I was also thinking... (I think there shouldn't be spaces around
'|'
in the gsub pattern)Thanks, edited it.
@Goutham Atla : I edited my example as well, Is that possible to make your solution more programmatic ? I am wondering any dynamic solution is possibly happen or not. Thanks a lot :)
Dear Goutham:
Thanks for your quick respond here. Yes, name of bed files is rather long, I designed function for accepting list of bed files (for example,
myFunc
accept at most tow or three peak files with same pattern as an input, that I showed in my example). Could you have more general solution? what if I choose other bed files with same pattern in my external data ? Is that possible to make your solution more programmatic ? Plus, here is my devel package hosted on github. Thanks a lot :)I don't think there can be a general solution. It depends on what defines the pattern that captures replicates, in your example stripping
"Rep.*"
is enough, but can you be sure that it is sufficient for other files? If the answer is no, than you could check that the number of files capture by each pattern is, say, between 2 and 4, but it may not be very robust.I guess very much in theory one could have some probabilistic, machine learning strategy to find patterns discriminating groups but I'm pretty sure you don't want to go there for the job you describe! For example one could compute the edit distance between pairs of file names and cluster names on the base of that similarity distance.
Dear dariober:
Thanks for your concern on my question. If you look at external data in my devel package , I must choose any three or two bed files for my package use. I think Goutham's solution so close to generate more robust solution. Maybe you could give me more interesting approach to solve this problem. Thank you :)
Best regards :
Jurat
Your package seems very interesting by the way, keep it up! Combining weak peaks is something I have been wondering about myself...
Dear dariober :
Thanks for you are being supportive, appreciated. Yes, it could be interesting packages will be hosted in Bioconductor. Any possible contribution from Biostar community is very appreciated. Thanks again.
Best regards :
Jurat