You can use BEDOPS to solve this problem.
First, reformat your variants into a sorted BED file, which uses UCSC chromosome names:
$ tail -n+2 variants.txt | awk -v FS="\t" -v OFS="\t" '{ print "chr"$1, ($2 - 1), $2, $3 }' | sort-bed - > variants.bed
Second, get gene annotations, e.g. from Gencode, assuming human:
$ GENCODE_URL=ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.basic.annotation.gff3.gz
$ wget -qO- "${GENCODE_URL}" | gunzip -c | awk '($3=="gene")' | gff2bed - | awk -v FS="\t" -v OFS="\t" '{ n=split($10,a,";"); p=a[4]; n=split(p,a,"="); $4=a[2]; print $0; }' | cut -f1-6 > genes.bed
Now you have variants and genes in sorted BED format, both using UCSC chromosome names.
If you want all genes within 100kb, you can use bedmap --range
to associate the variants with genes overlapping within that window:
$ bedmap --echo --echo-map-id-uniq --range 100000 --delim '\t' variants.bed genes.bed > answer.bed
The file answer.bed
will contain the variant position and the unique IDs of all genes overlapping a 200kb window, centered on the variant position — i.e., 100kb upstream and downstream.
The --echo-map-id-uniq
option can be replaced with --echo-map
, if you need the intervals or other metadata for overlapping genes; see the documentation for more detail.
This file can be read into R via read.table
or similar, if needed.
100kb is a pretty wide window. Why 100kb?
I am trying to find nearby genes.
100kb is really not nearby. That's why I ask, why that number?
Sequence Ontology define an upstream/downstream variant of a gene as within 5kb
I am performing rare variant analysis and want to find any loci associated within 100KB. This paper has additional information: https://alz-journals.onlinelibrary.wiley.com/doi/10.1002/alz.12319
I just think that with 3 million variants and 100kb around each, you're going to have more data than is ever going to be useful. You'll probably get every single gene several times over.
Actually, the variants I am interested in are about 100 only.
bedtools closest would be a good start