Question

Mince a vcf into n bins of a given range

0

Entering edit mode

2.8 years ago

AndrMod • 0

I have a vcf file containing SNPs from alla across a genome and thanks to the CHROM field I can split it into its various chromosomes; each SNP's POS is already relative to its chromosome position (indeed).

I want to split each chromosome in bins (windows) of 100,000 bp, thus obtaining some bins more enriched of my SNPs while other almost empty or straight empty. eg. :

object1: CHROM1
CHROM 1, bin 1 [1 - 100,000 bp] = 0 SNPs
CHROM 1, bin 2 [100,001 - 200,000 bp] = 12 SNPs (names of the SNPs)
...

object2: CHROM 2
CHROM 2, bin 1 [1 - 100,00] = 4 SNPs (names of the SNPs)

I managed to concot a manual approach on R. It has the good feature that each binned chromosome is a list of bins, each containing the SNP name and POS; however, it's very slow: it requires individual manipulation both of the chromosomes and the ranges.

I wonder if there is a faster way,be it by R or other tools, to bin chromosomes by predefinite ranges and obtainig bin easily attributable to their chromosomes.

So far I investigated various tools, but mostly they proceed the other way - putting n SNPs per bin, so that very unlinked SNPs may be binned together.

I will appreciate any light cast on this topic. Thanks to everybody.

vcf binning • 2.0k views

ADD COMMENT • link updated 2.8 years ago by Alex Reynolds 36k • written 2.8 years ago by AndrMod • 0

0

Entering edit mode

If you have range in bed format, I think bedtools intersect with count will help you. It takes vcf and bed format for input.

ADD REPLY • link 2.8 years ago by cpad0112 21k

0

Entering edit mode

I haven't the ranges in any format, but I suppose I can automate-write one of such files via R. Thanks!

ADD REPLY • link 2.8 years ago by AndrMod • 0

score 0 · Answer 1 · 2022-01-20

I wrote http://lindenb.github.io/jvarkit/VcfToIntervals.html

so a solution could be:

i=0; java -jar dist/vcf2intervals.jar -D 1m --bed indexed.vcf.gz | awk  '{printf("%s:%d-%s\n",$1,int($2)+1,$3);}' | while read R; do bcftools view -o "out.${i}.vcf.gz" -O z --regions "${R}" indexed.vcf.gz; i=$((i+1)); done

score 0 · Answer 2 · 2022-01-20

0

Entering edit mode

2.8 years ago

Alex Reynolds 36k

You can make bins with BEDOPS bedops --chop and Kent utilities fetchChromSizes:

$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 100000 - > hg38.100kb.bed

Replace hg38 with your reference genome name, if different.

Once you have bins, you can map them to your VCF file via vcf2bed:

$ bedmap --echo --count --delim '\t' hg38.100kb.bed <(vcf2bed --sort-tmpdir=${PWD} --max-mem=2G < variants.vcf) > answer.bg

The file answer.bg is a bedGraph file: four columns representing genomic position and counts of variants over each bin. This can be easily imported into R.

ADD COMMENT • link 2.8 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks, this seems to work for the overall binning! I just wonder if there is a way to recover the names of the SNPs stored in each bin.

ADD REPLY • link 2.8 years ago by AndrMod • 0

0

Entering edit mode

In the bedmap command, add --echo-map-id after --echo or --count, depending on where you want the names in the output. See bedmap --help for a full listing of options.

ADD REPLY • link 2.8 years ago by Alex Reynolds 36k