I have a vcf file containing SNPs from alla across a genome and thanks to the CHROM field I can split it into its various chromosomes; each SNP's POS is already relative to its chromosome position (indeed).
I want to split each chromosome in bins (windows) of 100,000 bp, thus obtaining some bins more enriched of my SNPs while other almost empty or straight empty. eg. :
object1: CHROM1
CHROM 1, bin 1 [1 - 100,000 bp] = 0 SNPs
CHROM 1, bin 2 [100,001 - 200,000 bp] = 12 SNPs (names of the SNPs)
...
object2: CHROM 2
CHROM 2, bin 1 [1 - 100,00] = 4 SNPs (names of the SNPs)
I managed to concot a manual approach on R. It has the good feature that each binned chromosome is a list of bins, each containing the SNP name and POS; however, it's very slow: it requires individual manipulation both of the chromosomes and the ranges.
I wonder if there is a faster way,be it by R or other tools, to bin chromosomes by predefinite ranges and obtainig bin easily attributable to their chromosomes.
So far I investigated various tools, but mostly they proceed the other way - putting n SNPs per bin, so that very unlinked SNPs may be binned together.
I will appreciate any light cast on this topic. Thanks to everybody.
If you have range in bed format, I think bedtools intersect with count will help you. It takes vcf and bed format for input.
I haven't the ranges in any format, but I suppose I can automate-write one of such files via R. Thanks!