Question

Problems in dividing genome to non-overlapping bins and count genes

0

Entering edit mode

8.3 years ago

milk841103 ▴ 10

Hello,

I was working on analyzing the distribution of CNV-affected genes in a genome, so I divided the genome into non-overlapping bins with same length and looking at the number of CNV-affected genes within each bin. the method I used was from this post: C: Finding gene density from reference genome using R As you can see, the GRange table containing location of all the bins had an extra column added showing number of genes (in my case, CNV-affected genes) in each bin. I then found the 10 bins that contains the most CNV-affected genes, extracted their coordinates and tried to ID the individual genes (from a list of CNV-affected genes that I generated) that locate in those bins for further analysis. the code I used to extract these genes was written in unix command from this post I posted: http://unix.stackexchange.com/questions/303809/select-multi-column-rows-based-on-ranges-specified-in-a-separate-file

however, I found that while there are 247 CNV-affected genes identified in these 10 bins using the first method in R, my unix command suggested otherwise as only 175 CNV-affected genes were found in the same 10 bins. my co-worker wrote a perl script for the same purpose and the results also showed 175 genes were found.

while using the R method, I found that in some bins the number of genes called by function countOverlaps was off by 1 when varifying using the subset function (both are mentioned in the first post) but I assume it was because genes crossing two bins, however i don't understand why the results I got from R and unix command could differ so much. Can anyone help explaining this issue? Thank you very much!!

genome R gene unix • 2.3k views

ADD COMMENT • link updated 8.3 years ago by Ram 44k • written 8.3 years ago by milk841103 ▴ 10

score 0 · Answer 1 · 2016-08-18

0

Entering edit mode

8.3 years ago

Alex Reynolds 36k

BED or tab-delimited text files you work with via Unix likely use a half-open 0-based index. Grange objects use a closed 1-based index. So this could potentially create one-off errors if you're not adjusting coordinates before trying to do an apples-to-apples set operation.

ADD COMMENT • link 8.3 years ago by Alex Reynolds 36k

0

Entering edit mode

thank you for your answer! do you mind telling me how should I adjust coordinates? and also could this 1-off error make such huge defference? because the sums of genes from 10 bins calculated using R and unix/perl script were off by 50...

ADD REPLY • link 8.3 years ago by milk841103 ▴ 10

0

Entering edit mode

It will depend on how you count things, but generally, you might start with the following overview and decide if this is an issue. You could start by testing overlaps on a handful of ranges via R/grange and Unix/Perl, to see if you get the same or different answers with whatever procedures you're using. Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems

ADD REPLY • link 8.3 years ago by Alex Reynolds 36k