Create 10,000bp windows for a SNP file and assign each SNP to its respective window
2
0
Entering edit mode
14 months ago
nitinra ▴ 50

Hello all,

I have a tab file with 700k SNPs (results from BayPass analysis of p-values for Genotype x environment association). I want to create 10k bp window intervals for each of the chromosome/contig and assign each SNP to its corresponding window.

This is the file I have (sample):

Chromosome     SNP        bf.db
Chr_1                 2007         -22.5
Chr_1                 7200        16.05
Chr_1                11500         12.05
Chr_1                 18000        61.05

This is what I need:

Chromosome     SNP        bf.db.     window
Chr_1                 2007         -22.5     0-10000
Chr_1                 7200        16.05     0-10000
Chr_1                11500         12.05  10001-20000
Chr_1                 18000        61.05  10001-20000

I can create windows using bedtools makewindows, but I do not know how to assign each SNP to it's corresponding window.

TIA!

genome snp vcf • 943 views
ADD COMMENT
4
Entering edit mode
14 months ago

I can create windows using bedtools makewindows, but I do not know how to assign each SNP to it's corresponding window.

convert the first file to bed using awk something like /(no tested):

awk '/^Chromosome/ {next;} {printf("%s\t%d\t%s\t%s\n",$1,int($2)-1,$2,$3);}' < file1.tsv

and then combine both files using bedtools intersect

ADD COMMENT
0
Entering edit mode

Pierre Lindenbaum im not ashamed to admit i came to this thread just so i could see how to do it in a one-liner with awk

ADD REPLY
2
Entering edit mode
14 months ago

Using BEDOPS, starting with a VCF-formatted file of SNP variants over hg38, this is a one-liner that writes data to answer.bed:

bedmap --echo --echo-map <(fetchChromSizes hg38 | awk -vFS="\t" -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 10000 -) <(vcf2bed < in.vcf) > answer.bed

In separate lines, here are the separated commands:

fetchChromSizes hg38 | awk -vFS="\t" -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 10000 - > windows.10kb.bed
vcf2bed < variants.vcf > variants.bed
bedmap --echo --echo-map windows.10kb.bed variants.bed > answer.bed

The one-liner will run faster, but the separated-command approach is easier to read and modify.

If you don't have variants in VCF format, get them into VCF format, or use an awk statement as described in another answer. If you use UCSC fetchChromSizes, you'll need to modify the format of the chromosome name to UCSC standard (chr1, chr2, etc.) for the mapping in bedmap to work correctly.

ADD COMMENT

Login before adding your answer.

Traffic: 1635 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6