One might start with GFF-formattted GENCODE annotations:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz | gunzip --stdout - > gencode.v21.gff
Using the feature ontology defined here, one can segregate GFF annotations by feature type (see: http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.217). Feature types include keywords like three_prime_UTR
, promoter
, etc. We can grab a sorted listing of feature types to automate this process. For example:
$ wget -qO- http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.217 | grep '^name:' | sed 's/name: //' | sort > gff_feature_types.txt
We can then segregate the GENCODE annotations by feature type:
$ while read feature_type; do grep ${feature_type} gencode.v21.gff > feature.${feature_type}.gff; done < gff_feature_types.txt
Let's assume that you have your variants in a VCF-formatted file called variants.vcf
. Let's convert it to BED with vcf2bed
:
$ vcf2bed < variants.vcf > variants.bed
For each smaller annotation file that is of non-zero size, we can convert its annotations to BED elements with gff2bed
. We then perform set operations against the variants, separating them into per-feature-type categories based on one or more bases of overlap with the annotation subset:
$ find . -name feature.*.gff ! -size 0 -exec bedops --element-of 1 variants.bed <(gff2bed < {}) > variants.{}.bed \;
Each non-empty file variants.*.bed
contains variants that overlap a GENCODE v21 feature by its feature type.
I have looked up into the two databases, it seems that i did not clearly declare the problem. the 2000SNPs i collected is not the one from NGS platform or array. It is just those i get from papers and databases. So the format of my file only contains the SNPs rs# number, chr#, position, alles, nothing else.
Thus, i want to quickly know which part are they. But it seems that the first database is what i want, i will carefully checked it to see if it is what i want.