One way is via the BEDOPS and UCSC Kent utilities toolkits.
First, generate chromosome bounds:
$ fetchChromSizes hg38 \
| awk -vOFS="\t" '{ print $1, "0", $2; }' \
| grep -vE '_' \
| sort-bed - \
> hg38.nuc.bed
Second, get SNPs (or use your own — replace common_20180418.starch
with a sort-bed
-sorted copy of a.txt
, if you wanted, for instance):
$ VCF=ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/common_all_20180418.vcf.gz
$ wget -qO- ${VCF}
| gunzip -c -
| convert2bed --input=vcf --sort-tmpdir=${PWD} -
| awk '{ print "chr"$0; }' -
| starch --omit-signature -
> common_20180418.starch
Finally, map SNPs to windows over the bounds. In the following example, we chop up the chromosome into sliding windows that are 100kb wide and staggered every 20kb. These windows are piped to bedmap
, which counts the number of SNPs that overlap each window:
$ bedops --chop 100000 --stagger 20000 hg38.nuc.bed \
| bedmap --echo --count --delim '\t' - common_20180418.starch \
> windows_with_counts.bed
The count of SNPs over the window is in the last column of windows_with_counts.bed
.
Be aware that the edges of chromosomes are often low-complexity regions known for accumulating all kinds of false-positives alignments and false variant calls. Read this paper.