Question

Determine whether SNPs in a list fall under the ranges of a BED file

0

Entering edit mode

6.6 years ago

gaelgarcia ▴ 270

Hello,

Is there a way in GenomicRanges or BEDTools to ask if SNPs in a file are included within the ranges specified in a BED file?

Thanks for your help.

cat snps.txt

    1:15820:G:T
    1:876499:A:G
    1:887560:A:C
    1:887801:A:G
    1:888639:T:C
    1:888659:T:C
    1:889158:G:C
    1:889159:A:C
    1:897325:G:C
    1:897738:C:T
    1:906272:A:C
    1:908823:G:A
    1:909238:G:C
    1:909309:T:C
    1:909419:C:T

cat sequencedRegions.bed

CHROMOSOME  START   STOP    LENGTH
chr1    92320598    92320634    36
chr1    196482330   196482379   49
chr2    32141903    32141946    43
chr6    33442452    33442494    42
chr8    25268844    25268885    41
chr10   5924750     5924801     51
chr10   28533997    28534034    37
chr11   117433440   117433471   31

SNP BED BEDTools R GenomicRanges • 1.9k views

ADD COMMENT • link updated 6.6 years ago by Alex Reynolds 36k • written 6.6 years ago by gaelgarcia ▴ 270

score 3 · Accepted Answer · 2018-05-01

3

Entering edit mode

6.6 years ago

Alex Reynolds 36k

Convert snps.txt to a sorted UCSC-like BED file with awk and BEDOPS sort-bed:

$ awk -vOFS="\t" -vFS=":" '{ print "chr"$1, ($2 - 1), $2, $3"/"$4; }' snps.txt | sort-bed - > snps.bed

Strip the header from the sequencedRegions.bed file and sort it:

$ tail -n+2 sequencedRegions.bed | sort-bed - > sequencedRegions.sorted.bed

Then map SNPs to regions with BEDOPS bedmap:

$ bedmap --echo --count --echo-map --delim '\t' sequencedRegions.sorted.bed snps.bed > answer.bed

The file answer.bed will have each region (--echo), along with the number (--count) of any overlapping SNPs in the next column, and the final column will contain all overlapping SNPs (--echo-map), separated by semi-colons. This last field can be blank if the overlap count is zero; add the --skip-unmapped option, if you want to leave out zero-overlap results.

ADD COMMENT • link 6.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks for your response! I'm trying out your solution, but for awk I get the error: awk: invalid -v option . Is this something to do with my version of awk on Mac perhaps?

ADD REPLY • link 6.6 years ago by gaelgarcia ▴ 270

1

Entering edit mode

On Mac, you might install GNU awk (gawk) via Homebrew: brew install gawk and just replace awk with gawk.

On rechecking, it looks like brew install gawk symlinks to awk, so that you can just use awk directly.

OS X ships with a BSD-specific build of awk, which results in some options being different or unavailable. This is an issue with BSD sed, as well. It can make running Unix scripts on OS X a bit of a pain in the ass, but Homebrew gawk and coreutils are popular packages for dealing with this, by installing GNU kit.

ADD REPLY • link 6.6 years ago by Alex Reynolds 36k

1

Entering edit mode

It is in the same lines of @ Alex code. snps.txt and sequencedRegions.bed are from OP. I added a range to sequencedRegions.bed as sequencedRegions.bed in OP, doesn't intersect SNP records

Input:

$ sed 's/^/chr/g;s/:/\t/g' snps.txt | awk -v OFS="\t" {'print $1,$2-1,$2,$3,$4'}|bedtools sort -i  > snps2.txt
$ sed '1d' sequencedRegions.bed | bedtools sort -i > test2.bed
$ bedtools intersect -a snps2.txt -b test2.bed -wa -wb

output:

    $ bedtools intersect -a snps2.txt -b test2.bed -wa -wb
chr1    888638  888639  T   C   chr1    888539  888739  200
chr1    888658  888659  T   C   chr1    888539  888739  200

ADD REPLY • link 6.6 years ago by cpad0112 21k