I have a list of genes and a coverage track (generated using BEDOPS-based binning script also called as binReads.sh
), modified it to take bed file as input).
What I want do it to divide the each gene into 100 bins and then calculate the mean overlap from my coverage file, so that later all the respective bins can be added for all genes to give a average protein binding/coverage in that bin.
I couldn't understand that how I can make a specific number of bins for the genes though I can regulate the binSize using binReads.sh
. I made a custom R code, for taking my list of genes and the dividing each of them into 100 bins.
Then, I used this list as a reference, (added a fake 4th column, which is also not very intuitive) to find the mean overlap for my coverage file.
Problems :
1) This method is not very intutitive (because of R code, it takes ~4-6 mins to complete the process for a list of 30.2K genes producing 30.2K*100 lines as a bed file to be used as a reference). It there a way in bedOps suite to generate the bins on the fly.
2) The output of bedmap --mean genes_bins.bed file.cov > means.tsv
is valid numeric for first 42K reference lines out of 3.2M(30.2K*100) and for the rest it produces "NAN". To cross-validate, I took some specific binned genes and it produces a valid output. Is this because of the size of the files (shouldn't be).
Can you suggest some other way to approach this problem.
genes_bins.bed
chr1 3214481 3219051 Xkr4-bin1
chr1 3219051 3223621 Xkr4-bin2
chr1 3223621 3228192 Xkr4-bin3
chr1 3228192 3232762 Xkr4-bin4
chr1 3232762 3237332 Xkr4-bin5
chr1 3237332 3241902 Xkr4-bin6
coverage.bed
chr1 3001500 3001520 1 1
chr1 3001520 3001540 1 1
chr1 3001540 3001560 1 1
chr1 3001560 3001580 1 1
chr1 3001580 3001600 1 1
chr1 3001600 3001620 1 1
chr1 3001620 3001640 1 1
P.S. It was started as comment under Converting BAM to bedGraph for viewing on UCSC and I thought its better to ask as a seperate question. This probably be answered by Alex Reynolds himself. The final aim to make the composite profiles of your protein of interest from the chip-seq datasets.
Thanks
sorry, I have a couple of questions. In the genes_bins.bed example file, the gene Xkr4 is divided into 6 bins. Is this correct? Second question: if you want to split each gene into 100 bins, how do you handle the fact that different genes may have different length?
Giovanni, thats just the
head
of the file, I will mention that. Secondly, thats the whole point of binning, the genes have unequal length, if you want to make a Plotting Read Density/ Distribution/ Composite Profiles [Chip-Seq], then the genes should be be binned and avg binding per bin is calculated in the end to be summed up for per bin for all bins of all genes, to get an average binding over the whole length of the gene. This will effect, that sum genes have more density in a bin and some less. Can you suggest some other strategy!!