I am working on 4C data where I have a .txt file that contains chromosome, start,end, nReads, RPMs, p.value, q.value and I am only interested in significant interactions in chr15 and later want to filter the interactions that are farther than 100kb and nearer to 3kb.
library(r3Cseq)
library(BSgenome.Hsapiens.UCSC.hg19.masked)
library(GenomicRanges)
library(Homo.sapiens)
kura.int <- read.table("KURA_DpnII.interaction.txt", header = T)
kura_data <- kura.int[kura.int$chromosome == "chr15" & kura.int$q.value > 0.1, ]
kura.int.gr <- makeGRangesFromDataFrame(kura_data, keep.extra.columns = T)
id <- "91433"
rccdGene <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene,
filter=list(gene_id=id))
rccdPromoter <- start(rccdGene)
kura_end <- ((rccdPromoter+kura_data$end)/2)
kura <- cbind(rccdPromoter, kura_end)
kura_2 <- cbind(kura, kura_data$chromosome)
colnames(kura_2) <- c("start", "end", "chr")
kura_3 <- kura_2[distance(kura_2$start, kura_2$end)<=100000]
In "kura_2" matrix I have 3 columns namely "chr", "start" and "end" where I have a new start as a promoter of the gene and different endings. So I tried the wrote the above block of code but when I come to the filtering step used function "distance" I am getting this error
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘distance’ for signature ‘"character", "character"’
Now I have a kura_2 matrix which contains 3 columns namely "chr" "start" "end"
Now, how do I filter the interactions that are more than 100kb and less than 3kb between the start and end? Is there a better way to filter out the interactions? Thank you in advance
Can you provide an example of the data in the post, and what you want the end product to look like?
dput(head(kura_2, n=10))
will output it the first 10 rows of the matrix in a form you can share in your post.I have updated the question attaching the kura_2 data. The output will typically look the same with chr, start and end but without the interactions that are greater than 100kb and less than 3kb.
From your post you say that
start
is the start of the promoter. What doesend
represent then, the other points in the genome that were interacting with the promoter?Also, why are there decimal places in the
end
column? Genomic coordinates are usually represented as integers.Yes, the new start is the promoter of the gene and the new end is ((start+end)/2) that's the reason I have float values because in this way it is easy to plot interactions from my promoter (bait)
Alright, and one more question. Right now you are asking to keep interactions that are more than 100kb AND less than 3kb. Do you mean more than 100kb OR less than 3kb, or instead do you mean between 100kb and 3kb?
I want to remove the interactions that are further away than 100kb from the promoter and also that are closer than 3kb from the promoter because in 4C we can't tell if these closer interactions are artifacts or real ones, most likely self-ligation products which I am not interested.