Entering edit mode
5.3 years ago
brunobsouzaa
▴
830
Hi guys,
I'm using R to modify cnvkit output for my analysis. The thing is that I have multiple region files (*.cnr) and it's time-consuming (and not necessary) to run the script in each file. So, does anyone know what can I do? Basically what I need is to loop through "file", "file_1", "file_2", "file_n". So, the output always needs to be the exact filename!
The script:
# Clear workspace
rm(list=(ls()))
# Load File and Reference
file <- read.csv("file.cnr", header=T, sep="\t")
ref <- read.csv("ref.cnn", header=T, sep="\t")
# Merge both files by "start" position
merged <- merge(file, ref, by="start", suffixes=c(".file", ".ref"))
# Round "log2" column
merged$log2file <- round(merged$log2.file, digits=1)
# re-calculate "cn" based on log2 correction
merged$cn <- round(2*(2^(merged$log2.file)))
# Subset file with all "cn" values that are not 2
alt.cn <- subset(merged, merged$cn !=2)
# Create new data with columns of interest
alt.cns <- as.data.framealt.cn[, c(1:8,13)])
# Re-order columns for better view
alt.cns <- alt.cns[c(2,1,3,4,6,5,8,7,9)]
# Calculate ratio between coverages
alt.cns$depth.ratio <- round(alt.cns$depth.file / alt.cns$depth.ref, digits=2)
alt.cns$depth.ratio.1 <- round(alt.cns$depth.file / alt.cns$depth.ref, digits=2)
## Function to call for DUP or DEL.
alt.cns$SV_type <- ifelse(alt.cns$cn < 2, "DEL", "DUP")
# Convert "alt.cns" to .bed file
full <- alt.cns[c(1,2,3,12,5,4,6,7,8,9,10,11)]
names(full)[1] <- "#Chrom"
names(full)[2] <- "Start"
names(full)[3] <- "End"
names(full)[4] <- "SV_type"
names(full)[6] <- "gene"
names(full)[7] <- "log2"
# Save "alt.cns" as .bed file
write.table(full, file="/path/to/output/file.bed", row.names=F, col.names=T, sep="\t")
# Filter "alt.cns" file
filtered <- subset(alt.cns, alt.cns$depth.ratio > 0.6 & alt.cns$depth.ratio.1 > 1.4 & alt.cns$weight > 0.3)
filtered <- filtered[c(1,2,3,12,5,4,6,7,8,9,10)]
names(filtered)[1] <- "#Chrom"
names(filtered)[2] <- "Start"
names(filtered)[3] <- "End"
names(filtered)[4] <- "SV_type"
names(filtered)[6] <- "gene"
names(filtered)[7] <- "log2"
#Save file
write.table(filtered, file="/path/to/output/file.filtered.bed", row.names=F, col.names=T, sep="\t")
Thanks!
What have you tried? Please consider doing some for loop tutorials first and see how far you can come (there are many available, search google). If you get stuck somewhere show us what you have tried and we will try to find a solution for your obstacle. I think that just asking us to rewrite your current script into a loop is not the way to learn bioinformatics.
You're absolutely right... I've tried some for loops, the last one:
files <- Sys.glob("path/to/*.cnr") For ( i in files ) { my_script_here
} But didn't worked! I've also tried to split the variables inside my "files" but it also didn't work. Actually, this is only a small part of my pipeline, all other parts (in bash) are inside a for loop and works like a charm! To be more specific, I've just posted the script so one can see what is going on... I think it's easier to see and maybe the script can help someone else!
Does something like this work?
Tried this approach, it breaks on the "merge" step. For some reason, it doesn't recognize my $start column from files! "Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column"
see what colnames you have:
NB. Maybe
file
is not the smartest name since it is a function to in R.colnames are ok for "sample" and "s_ref" variables
Note: if I don't try to apply "for", the script works well for each sample!
All right, this is the first time you use
sample
. In your initial script nowheresample
. It is impossible to help you this way, I can't just guess what's going, if you know what I mean...Sorry for that, just changed "file" by "sample" since it's an R function... But thanks for your help, I'm trying to figure out what is going on with the loop... If I find the solution, I'll post it here!
N.B.
sample
is a function too. But that is probably not the problem.The error you get now, is that the columns are not unique, so are there not unique values in one of these columns?
Those columns contains start positions from my CNVs, so yes, they are unique values. I'm just matching the two files by those values... For single samples, this works well, the problem is inside the loop!
Okay, if you say so. Good luck!
It might be useful just to test it with
unique
function. Coordinates on different chromosomes can be the same...