BEDOPS scales to arbitrarily-sized inputs. You could use Kent utilities fetchChromSizes
to make a BED file of reference genome chromosome sizes (or your own approach, if you're bringing your own genome), use vcf2bed
to convert the VCF file to BED, use bedops --chop
to generate windows, and use bedmap
to count SNPs over sliding windows.
Generating chromosome bounds, e.g. for hg38
:
$ fetchChromSizes hg38 | awk '{ print $1, "0", $2 }' > hg38.bed
Or if you have a de novo genome, you would use your technique of choice to generate a similar bounds file.
Making sliding windows, e.g. 10k windows:
$ bedops --chop 10000 hg38.bed > hg38.windows.bed
Counting SNPs over windows that are padded out 5k on both sides:
$ bedmap --echo --count --range 5000 hg38.windows.bed <(vcf2bed < snps.vcf) > hg38.windows.counts.bed
Using --range
this way does counting where consecutive windows overlap by 5kb.
Once you see how the above steps are done, you can concatenate them into a one-liner:
$ fetchChromSizes hg38 | awk '{ print $1, "0", $2 }' | bedops --chop 10000 - | bedmap --echo --count --range 5000 - <(vcf2bed < snps.vcf) > hg38.windows.counts.bed
Using Unix streams in this way eliminates the time cost in making intermediate files, so this scales up for whole-genome inputs.
Once you have a file like hg38.windows.counts.bed
(for whatever reference genome and window size and sliding parameters you choose) you could remake it into a Bedgraph file and then into bigWig to plot density in a UCSC genome browser instance. Or plot the signal directly with Circos or R or other tools to view density per-chromosome.
How large is the vcf file for 1 million SNPs?
Hi my vcf file size is 911M