Question

Identify gene symbols given a list of chromosome positions

0

Entering edit mode

10.0 years ago

cw00137 ▴ 10

I have downloaded ChIP-Seq data and managed to get to a point where I have a long list of chromosome positions and some expression data. My question is, how to map these chromosome locations to HUGO gene symbols?

An example of my data is:

Peak                     GSM365925_ER_minus_ligand_align.bed     GSM365926_ER_E2_align.bed
chr20:257411-257873|     7                                       49
chr20:363265-363667|     0                                       98
chr20:373762-374404|     3                                       170
chr20:549324-550256|     1                                       23

I would preferably like a solution in python, though I could stretch to one in perl and a solution with a GUI interface would also be welcome as I'm still a relatively inexperienced programmer. Thanks

ChIP-Seq • 15k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by cw00137 ▴ 10

Ram · Answer 1 · 2014-11-19

Using python and cruzdb:

from cruzdb import Genome

hg19 = Genome('hg19')

for i, line in enumerate(open('dat.txt')):
    toks = line.split()
    if I == 0:
        print "\t".join(['gene'] + toks)
    else:
        chrom, posns = toks[0].split(":")
        start, end = map(int, posns.rstrip("|").split("-"))
        genes = hg19.bin_query('refGene', chrom, start, end)
        print "\t".join(["|".join(set(g.name2 for g in genes))] + toks)

Ram · Answer 2 · 2014-11-19

You could try Biomart

There's a query engine under Tools > Browse data, use the Region filter field to fill in multiple regions

For example; chr20:257411-257873 would be (you might want to check both strands)

20:257411:257873:1
20:257411:257873:0

Under output section select HGNC Symbol from External references plus any others you want.

Or if you want to do this programmatically there's an example script here of a BioMart interface in Ruby which could be altered to return gene symbols by genomic position.

Ram · Answer 3 · 2014-11-19

1

Entering edit mode

10.0 years ago

Alex Reynolds 36k

First, grab RefSeq names via UCSC:

Visit: http://genome.ucsc.edu/cgi-bin/hgTables?command=start
Pick your assembly from the Assembly pull-down menu
Choose RefSeq Genes from the Track pull-down menu
Pick BED - browser extensible data from the Output Format pull-down menu
Click on Get Output to download data in BED format (say, a file called "refseq.bed")

Second, convert your peaks to a sorted BED file:

$ tail -n+1 peaks.txt | cut -f1 -d '|' - | tr -s ':' '-' | sed s/-/\\t/g | sort-bed - > peaks.bed

Then do an efficient set operation with BEDOPS:

$ bedmap --echo --echo-map-id-uniq peaks.bed refseq.bed > peaks_with_refseq_annotations.bed

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 10.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Hi Alex - I am assuming that I need to upload my data using the 'Add Custom Tracks' thing? ... Also, the files I'm analyzing were downloaded and are already in .BED files. For each condition I have one 'peaks' file which I'm assuming is a positive control for analysis and a ' align' file. Are you suggesting I should sort them into another .BED file. Thanks.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by cw00137 ▴ 10

0

Entering edit mode

No, I would not upload data, but download RefSeq gene names to a local BED file, and then do the mapping between peaks and RefSeq gene names with BEDOPS and standard UNIX command line tools.

If your data are already in BED format, you can just do something like: $ sort-bed peaks.bed > sorted_peaks.bed to ensure that the data are sorted. BEDOPS tools were written to take advantage of sorted BED data.

I know you said that you are not a programmer, but learning the command line is important enough that I figured I would suggest that approach with some example commands that give you the answer you want. I hope this answer gives you a taste of what you can do.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Hi alex! I'm wanting to do something similar but I only have the start site position in my sorted BED file e.g.

chr1 9061386
chr1 9061390
...

it has about ~20 lines, unfortunately my system doesn't have the most up to date version of gcc req'd for BEDOPS installation... is there another way of doing this at the command line without using BEDOPS? (yes I could install more up to date versions of gcc and g++ but that's currently out of my hands)

Look forward to your response!

ADD REPLY • link 6.2 years ago by s1667153 • 0

Ram · Answer 4 · 2014-11-19

GALAXY is your friend here. If you are going to be working with data based on genome coordinates then it will be you friend for "life" :)

You can upload you ChIP-seq regions, then (following Alex Reynolds instructions steps 1 to 4) upload the BED file of refseq genes. You can overlap the genome coordinates using the intersect tool. You may want to extend the refseq coordinates beyond the boundaries of the transcripts to increase the range of the possible overlap.

Ram · Answer 5 · 2014-11-19

0

Entering edit mode

10.0 years ago

Ming Tommy Tang ★ 4.5k

Easiest way: PAVIS and http://cru.genomics.iit.it/AnnotateGenomicRegions/index.jsf

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Ming Tommy Tang ★ 4.5k

score 0 · Answer 6 · 2017-03-24

0

Entering edit mode

7.7 years ago

Tommy Carstensen ▴ 210

Here is yet another Python solution using pyensembl:

import pyensembl
ensembl = pyensembl.EnsemblRelease(release=75)
gene_names = ensembl.gene_names_at_locus(contig=chrom, position=pos)

ADD COMMENT • link 7.7 years ago by Tommy Carstensen ▴ 210