How to search for genes 100Kb up and downstream of any genomic site
2
0
Entering edit mode
3.4 years ago
MAPK2 ▴ 50

I have a table like this below for 100 variant sites. I want to find genes that are 100kb up and downstream of select variants. For example, I want to find any genes that fall within 100KB up/down-stream of position chr1:925881.

CHROM    POS        gene
1      925881       RP11
1      925881       RP11
1      930248       SAMD11
1      930248       SAMD11
1      943136       NOC2L
1      943136       NOC2L

Could someone please suggest to me the solution for this? My preferred programming language is R.

R • 2.1k views
ADD COMMENT
0
Entering edit mode

100kb is a pretty wide window. Why 100kb?

ADD REPLY
0
Entering edit mode

I am trying to find nearby genes.

ADD REPLY
0
Entering edit mode

100kb is really not nearby. That's why I ask, why that number?

Sequence Ontology define an upstream/downstream variant of a gene as within 5kb

ADD REPLY
0
Entering edit mode

I am performing rare variant analysis and want to find any loci associated within 100KB. This paper has additional information: https://alz-journals.onlinelibrary.wiley.com/doi/10.1002/alz.12319

ADD REPLY
0
Entering edit mode

I just think that with 3 million variants and 100kb around each, you're going to have more data than is ever going to be useful. You'll probably get every single gene several times over.

ADD REPLY
0
Entering edit mode

Actually, the variants I am interested in are about 100 only.

ADD REPLY
0
Entering edit mode

bedtools closest would be a good start

ADD REPLY
2
Entering edit mode
3.4 years ago
Emily 24k

I suggest running your variants through the VEP. You will get detailed information about the effects on genes. The default setting for up/downstream is 5kb but you can alter that to 100kb. If you're looking for variants associated with common disease, you might want to consider using the postgap plugin as well which will find variants in LD with your input variants and look all of them up in GTEx, which may give you more useful information about how these variants may affect gene expression.

ADD COMMENT
0
Entering edit mode

Well, I wanted to find the ovarlapping variants from the paper I mentioned before.

ADD REPLY
0
Entering edit mode
3.4 years ago

You can use BEDOPS to solve this problem.

First, reformat your variants into a sorted BED file, which uses UCSC chromosome names:

$ tail -n+2 variants.txt | awk -v FS="\t" -v OFS="\t" '{ print "chr"$1, ($2 - 1), $2, $3 }' | sort-bed - > variants.bed

Second, get gene annotations, e.g. from Gencode, assuming human:

$ GENCODE_URL=ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.basic.annotation.gff3.gz
$ wget -qO- "${GENCODE_URL}" | gunzip -c | awk '($3=="gene")' | gff2bed - | awk -v FS="\t" -v OFS="\t" '{ n=split($10,a,";"); p=a[4]; n=split(p,a,"="); $4=a[2]; print $0; }' | cut -f1-6 > genes.bed

Now you have variants and genes in sorted BED format, both using UCSC chromosome names.

If you want all genes within 100kb, you can use bedmap --range to associate the variants with genes overlapping within that window:

$ bedmap --echo --echo-map-id-uniq --range 100000 --delim '\t' variants.bed genes.bed > answer.bed

The file answer.bed will contain the variant position and the unique IDs of all genes overlapping a 200kb window, centered on the variant position — i.e., 100kb upstream and downstream.

The --echo-map-id-uniq option can be replaced with --echo-map, if you need the intervals or other metadata for overlapping genes; see the documentation for more detail.

This file can be read into R via read.table or similar, if needed.

ADD COMMENT

Login before adding your answer.

Traffic: 2603 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6