Via BEDOPS vcf2bed
, bedops
, and bedmap
, one can build a BED file of insertions and deletions, and map their sizes to some reference genome, piping a list of sizes to a histogram/statistical summary tool, such as the hist
tool in bashplotlib
.
First, to build a list of INDELs:
$ vcf2bed --insertions < variants.vcf > insertions.bed
$ vcf2bed --deletions < variants.vcf > deletions.bed
$ bedops --everything insertions.bed deletions.bed > union.bed
Or via a one-liner:
$ bedops --everything <(vcf2bed --insertions < variants.vcf) <(vcf2bed --deletions < variants.vcf) > union.bed
To get chromosome bounds for hg38
(adjust for your reference genome), you can use Kent Utilities fetchChromSizes
:
$ fetchChromSizes hg38 | awk -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - > hg38.bed
To get a quick size distribution summary, one can pipe the output of bedmap --echo-map-size
to the hist
tool in bashplotlib
.
Here's an example of output from a random subset of 1000Genome SNPs off of chromosome X:
$ bedmap --echo-map-size --multidelim '\n' --skip-unmapped hg38.bed union.bed | hist
29| o
28| o
26| o
25| o
23| o
22| o
20| o
19| o
17| o
16| o
14| o
13| o
11| oo
10| oo
8| oo
7| oo
5| oo
4| oo
2| oo
1| ooo o
-----------
----------------------------------
| Summary |
----------------------------------
| observations: 44 |
| min value: 1.000000 |
| mean : 84.522727 |
| max value: 3105.000000 |
----------------------------------
Such a list of sizes could instead be piped into a Python or R script (for example) to generate a summary in any other desired format.
what is this command's tools? vcftools?