Hello,
I was working on analyzing the distribution of CNV-affected genes in a genome, so I divided the genome into non-overlapping bins with same length and looking at the number of CNV-affected genes within each bin. the method I used was from this post: C: Finding gene density from reference genome using R As you can see, the GRange table containing location of all the bins had an extra column added showing number of genes (in my case, CNV-affected genes) in each bin. I then found the 10 bins that contains the most CNV-affected genes, extracted their coordinates and tried to ID the individual genes (from a list of CNV-affected genes that I generated) that locate in those bins for further analysis. the code I used to extract these genes was written in unix command from this post I posted: http://unix.stackexchange.com/questions/303809/select-multi-column-rows-based-on-ranges-specified-in-a-separate-file
however, I found that while there are 247 CNV-affected genes identified in these 10 bins using the first method in R, my unix command suggested otherwise as only 175 CNV-affected genes were found in the same 10 bins. my co-worker wrote a perl script for the same purpose and the results also showed 175 genes were found.
while using the R method, I found that in some bins the number of genes called by function countOverlaps was off by 1 when varifying using the subset function (both are mentioned in the first post) but I assume it was because genes crossing two bins, however i don't understand why the results I got from R and unix command could differ so much. Can anyone help explaining this issue? Thank you very much!!
thank you for your answer! do you mind telling me how should I adjust coordinates? and also could this 1-off error make such huge defference? because the sums of genes from 10 bins calculated using R and unix/perl script were off by 50...
It will depend on how you count things, but generally, you might start with the following overview and decide if this is an issue. You could start by testing overlaps on a handful of ranges via R/grange and Unix/Perl, to see if you get the same or different answers with whatever procedures you're using. Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems