One idea is to:
- Map genomic positions ("coordinates") to gene name annotations using set operation tools like BEDOPS
bedmap
- Convert gene name annotations to a format that can be used with gene ontology ("GO") tools, such as EMBL format, in order to work with EMBL's QuickGO tool
- Run those converted gene names through the GO tool-of-choice to get functional annotations
As a more concrete example of this workflow, you could convert a sort-bed
sorted BED file of coordinates of interest (intervals.bed
) to a text file of HGNC symbols/names with BEDOPS convert2bed
and bedmap
.
This uses Gencode annotations that are hg38; make adjustments depending on your genome:
$ sort-bed intervals.unsorted.bed > intervals.bed
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3 == "gene"' \
| convert2bed -i gff - \
> gencodeV21.genes.bed
$ paste <(cut -f1-3 gencodeV21.genes.bed) <(grep -Po 'gene_name=\K[^ ;]+' gencodeV21.genes.bed) > gencodeV21.HGNCgenes.bed
$ bedmap --echo-map-id --multidelim '\n' --skip-unmapped intervals.bed gencodeV21.HGNCgenes.bed > genes.txt
(Note: If you're on OS X, use Homebrew to install GNU grep
and replace grep
with ggrep
.)
If you have a text file of HGNC symbols/names on each line, you can convert them to EMBL names with the following Python script:
#!/usr/bin/env python
import sys
import mygene
mg = mygene.MyGeneInfo()
genes = []
for line in sys.stdin:
genes.append(line.strip())
results = mg.querymany(genes, scopes="symbol", fields=["uniprot"], species="human", verbose=False)
for res in results:
name = res["query"]
if "uniprot" in res and "TrEMBL" in res["uniprot"]:
recs = res["uniprot"]["TrEMBL"]
if isinstance(recs, basestring):
sys.stdout.write("%s\t%s\n" % (name, recs))
else:
for rec in recs:
sys.stdout.write("%s\t%s\n" % (name, rec))
For example, if you start with a text file called genes.txt
:
DDX26B
CCDC83
MAST3
RPL11
ZDHHC20
LUC7L3
SNORD49A
CTSH
ACOT8
Then you can run the script on the genes to map them to EMBL equivalents:
$ ./hgnc_to_embl.py < genes.txt > genes_mapping.txt
The resulting file would look something like:
CCDC83 H0YDV3
MAST3 V9GYV0
RPL11 Q5VVC8
RPL11 Q5VVC9
RPL11 Q5VVD0
ZDHHC20 A0A0D9SEN4
ZDHHC20 B4DRN8
LUC7L3 A8K3C5
LUC7L3 C9JL41
LUC7L3 D6RDI2
LUC7L3 D6RHH0
LUC7L3 E7EN55
LUC7L3 H0YA81
LUC7L3 H0YAR4
LUC7L3 H0YAX1
LUC7L3 H0YAY6
LUC7L3 H0YBV7
LUC7L3 H7C5U7
LUC7L3 J3KPP4
LUC7L3 Q86Y74
LUC7L3 U3KQT3
CTSH A0A087X0D5
CTSH A0A0B4J217
CTSH E9PKT6
CTSH E9PN60
CTSH E9PN84
ACOT8 E9PIS4
ACOT8 E9PJN0
ACOT8 E9PMC4
ACOT8 E9PRD4
ACOT8 F6VBM3
ACOT8 H0Y698
ACOT8 H7C5A7
ACOT8 Q9BR14
(Note that some HGNC names do not map to EMBL names. Consider how this issue might affect any statistical tests you do with GO annotations.)
To get a text file containing just the EMBL names:
$ cut -f2 genes_mapping.txt > embl.txt
Once you have those EMBL names, you can submit them to QuickGO and export the results:
- Visit: http://www.ebi.ac.uk/QuickGO/
- Click on the "View GO Annotations" panel
- Click on the "Gene Product ID" button in the toolbar
- Copy and paste the names from
embl.txt
into the text field (or paste in your result set from running things locally with your genes of interest)
- Click the "Add" button
- Click the "Apply" button to perform the search
- Click the "Export" button to export the search results