If you want to combine the genes if they are very close to each other, such that all transcript in the file will have a minimum distance from each other, you can use bedtools merge. The "-d" argument allows to you control how close they are if you want to merge them.
If you want to find genes that have a good distance from the other genes, you can first use bedtools complement. This allows you to locate regions without genes in the file. Next, you can use some scripts, for exmaple Python or R, to find all large consecutive complements that occur in your file. The genes that you want are all located in the middle of these large consecutive complements.
Here is a simple R script for this purpose:
ComFile <- as.matrix(read.table("YourComplement.bed", sep="\t", header=F))
#The genes' minimum distance from the other genes. (Edit this line)
minDistance <- 3000
#Chr, Start and End of your wanted genes
Chr <- c()
Start <- c()
End <- c()
Previous_Com_Length <- as.numeric(ComFile[1,3]) - as.numeric(ComFile[1,2])
Previous_Chr <- as.character(ComFile[1,1])
for (i in 2:nrow(ComFile)) {
This_Com_Length <- as.numeric(ComFile[i,3]) - as.numeric(ComFile[i,2])
This_Chr <- as.character(ComFile[i,1])
if (This_Chr == Previous_Chr) {
if (Previous_Com_Length >= minDistance && This_Com_Length >= minDistance) {
Chr <- c(Chr, This_Chr)
Start <- c(Start, as.numeric(ComFile[i-1,3]))
End <- c(End, as.numeric(ComFile[i,2]))
}
}
Previous_Com_Length <- This_Com_Length
Previous_Chr <- This_Chr
}
Answer <- data.frame(Chr=Chr, Start=Start, End=End)
write.table(Answer, "YourGenes.bed", quote=FALSE, sep="\t", col.names=FALSE, row.names=FALSE)
Note: the reason I prefer using bedtools complement, instead of directly working on the current gene list, is because you can provide chromosome lengths to complement first. This makes the script a bit less complicated.
I hope this helps,
What do you mean by "well-separated" genes? Genes are where they are at (to the best of our current knowledge). Unless you start cherry-picking ones you like how can they be well-separated?
Thanks for the quick response!!
Very true, but genes that are close together screw up genome-wide ChIP profiles, especially genes that are arranged head-to-head or tail-to-tail. I guess examples are the best way to explain what I'm trying to say!
Check out the gene TRIB1: http://imgur.com/jk8beqp. It has no genes within 8 kb upstream or downstream. You can ascribe the ChIP signal seen in both Pol II and Pol II CTR phospho-Ser2 (pS2) to discrete events within transcription: initiation, elongation, termination. There's a peak of total Pol II at the promoter, and a second small peak in the termination zone downstream of the end of the gene. This second peak corresponds to Ser2 phosphorylation, which only has a single peak. This is a good model gene to understand phosphorylation.
Alternatively, check out Cdk4: http://imgur.com/oWsgbvq. We cannot see a termination zone peak of Pol II because it is obscured by the promoter-proximal peak of TSPAN31. We can't necessarily assign the pS2 peak to the termination zone of Cdk4, because we also have to consider behavior on TSPAN31. This would be a good gene to NOT include in a genome-wide profile.
So, I would like to get a big list of genes that include the well-separated genes that have no neighboring or overlapping genes, so that we can make confident calls of Pol II and TF behavior. I don't think this is cherry-picking, I think it is providing clarity. Would you agree?