Dear all,
I've been trying to classify promoters into HCP, LCP and ICP and didn't had success, so I ask for help or suggestions.
The workflow I used is:
- Obtain all UCSC refseq promoters
- Take TSS of all genes and make windows around TSS of -2Kb and +500bp
- Divide each window in bins of 500bp with a 5bp offset (sliding of 5bp) --> bedtools makewindows -i src -w 500 -s 5
At each window calculate: # of CGs, # of Cs, # of Gs, and % of GC content
bedtools nuc -pattern CG -fi genome.fasta -bed genes.TSS.w500.s5.bed | awk 'NR>1{OFS="\t"; print $1,$2,$3,$4,$14,$8,$9,$13,$6}' > file
The final file looks like:
CHR Start End Gene CGs Cs Gs Length GC_content chr1 12268 12373 DDX11L1 6 40 35 105 0.714286 chr1 12273 12373 DDX11L1 5 38 32 100 0.700000 chr1 12278 12373 DDX11L1 4 34 32 95 0.694737 chr1 12283 12373 DDX11L1 4 31 31 90 0.688889 chr1 12288 12373 DDX11L1 4 30 29 85 0.694118
Use R to find genes which are HCP, LCP or ICP using the following script:
t<-read.table(file,sep="\t") ratio_cpg <- function(x){ x1=x[,1]*x[,4] x2=x[,2]*x[,3] x3=x1/x2 return(x3) } t2=cbind(t[,1:ncol(t)],ratio_cpg(t[,5:9])) colnames(t2)<-c("CHR","Start","End","Gene","CGs","Cs","Gs","Length","GC_content","CpG_ratio") names=levels(t2[,4]) hcp_genes <- NULL; hcp_values<-NULL; lcp_genes <- NULL; lcp_values<-NULL; icp_genes <- NULL; icp_values<-NULL; for ( i in 1:length(names) ) { ## Take table slices which gene names are equal to NAMES vector created gc_con <- as.data.frame(t[ t[,4] == names[i] ,][,9]) cpg_ratio <- as.data.frame(t[ t[,4] == names[i] ,][,10]) values <- cbind(gc_con,cpg_ratio) ## Find HCP promoters if ( length(which(values[,1] > 0.55)) > 0 && length(which(values[,2] > 0.75 )) > 0 ) { if ( is.null(hcp_genes) ) { hcp_genes <- names[i] hcp_values <- mean(as.vector(cpg_ratio[cpg_ratio[,1]>0.75,]),na.rm=TRUE) } else { if ( !isTRUE( names[i] %in% hcp_genes ) ) { hcp_genes <- c(hcp_genes,names[i]) hcp_values <- c( hcp_values, mean(as.vector(cpg_ratio[cpg_ratio[,1]>0.75,]),na.rm=TRUE) ) } } next ## Find LCP promoters } else if ( length(which(values[,2] < 0.48)) > 0 && length(which(values[,2] > 0.48)) == 0) { if ( is.null(lcp_genes) ) { lcp_genes <- names[i] lcp_values <- mean(as.vector(cpg_ratio[cpg_ratio[,1]<0.48,]),na.rm=TRUE) } else { if ( !isTRUE( names[i] %in% lcp_genes ) ) { lcp_genes <- c(lcp_genes,names[i]) lcp_values <- c( lcp_values, mean(as.vector(cpg_ratio[cpg_ratio[,1]<0.48,]),na.rm=TRUE) ) } } next ## Find ICP promoters } else { if ( is.null(icp_genes) ) { icp_genes <- names[i] icp_values <- mean(as.vector(cpg_ratio[cpg_ratio[,1]>=0.48 & cpg_ratio[,1]<=0.75,]),na.rm=TRUE) } else { if ( !isTRUE( names[i] %in% icp_genes ) ) { icp_genes <- c(icp_genes,names[i]) icp_values <- c( icp_values, mean(as.vector(cpg_ratio[cpg_ratio[,1]>=0.48 & cpg_ratio[,1]<=0.75,]),na.rm=TRUE) ) } } next } }
Unfortunately with this method I could not obtain the results as the paper of Weber (2007, http://www.nature.com/ng/journal/v39/n4/full/ng1990.html#a1), and I don't know why.
This is my result after this steps:
Could anybody give me some advice on this issue?
Thanks for your help!
Thanks for your point. So you are suggesting that a good starting point would be to just calculate the CpG ratio over promoters and just plot to see if the results are close enough? In that case, do you think that the strategy of defining bins over promoter is not suitable for just plotting the left side of the figure? I mean that maybe I just need to consider that in order to plot that histogram, consider calculating CpG ratio on a window 1000bp for example and use only one value/promoter? (as opposed to calculate CpG ratio over windonws of 500 bins)
yes, start with that. People tend to come up with some tweaks to the way they define concepts and you have to make sure that you are sufficiently close to that to begin with. I have skimmed over the methods but I did not fully understand how the histograms are separated. But as I pointed out note these do actually overlap whereas in your plot you have sharp cutoffs.