Identify gene symbols given a list of chromosome positions
6
0
Entering edit mode
10.1 years ago
cw00137 ▴ 10

I have downloaded ChIP-Seq data and managed to get to a point where I have a long list of chromosome positions and some expression data. My question is, how to map these chromosome locations to HUGO gene symbols?

An example of my data is:

Peak                     GSM365925_ER_minus_ligand_align.bed     GSM365926_ER_E2_align.bed
chr20:257411-257873|     7                                       49
chr20:363265-363667|     0                                       98
chr20:373762-374404|     3                                       170
chr20:549324-550256|     1                                       23

I would preferably like a solution in python, though I could stretch to one in perl and a solution with a GUI interface would also be welcome as I'm still a relatively inexperienced programmer. Thanks

ChIP-Seq • 15k views
ADD COMMENT
2
Entering edit mode
10.1 years ago
brentp 24k

Using python and cruzdb:

from cruzdb import Genome

hg19 = Genome('hg19')

for i, line in enumerate(open('dat.txt')):
    toks = line.split()
    if I == 0:
        print "\t".join(['gene'] + toks)
    else:
        chrom, posns = toks[0].split(":")
        start, end = map(int, posns.rstrip("|").split("-"))
        genes = hg19.bin_query('refGene', chrom, start, end)
        print "\t".join(["|".join(set(g.name2 for g in genes))] + toks)
ADD COMMENT
1
Entering edit mode
10.1 years ago
Garan ▴ 690

You could try Biomart

There's a query engine under Tools > Browse data, use the Region filter field to fill in multiple regions

For example; chr20:257411-257873 would be (you might want to check both strands)

20:257411:257873:1
20:257411:257873:0

Under output section select HGNC Symbol from External references plus any others you want.

Or if you want to do this programmatically there's an example script here of a BioMart interface in Ruby which could be altered to return gene symbols by genomic position.

ADD COMMENT
1
Entering edit mode
10.1 years ago

First, grab RefSeq names via UCSC:

  1. Visit: http://genome.ucsc.edu/cgi-bin/hgTables?command=start
  2. Pick your assembly from the Assembly pull-down menu
  3. Choose RefSeq Genes from the Track pull-down menu
  4. Pick BED - browser extensible data from the Output Format pull-down menu
  5. Click on Get Output to download data in BED format (say, a file called "refseq.bed")

Second, convert your peaks to a sorted BED file:

$ tail -n+1 peaks.txt | cut -f1 -d '|' - | tr -s ':' '-' | sed s/-/\\t/g | sort-bed - > peaks.bed

Then do an efficient set operation with BEDOPS:

$ bedmap --echo --echo-map-id-uniq peaks.bed refseq.bed > peaks_with_refseq_annotations.bed
ADD COMMENT
0
Entering edit mode

Hi Alex - I am assuming that I need to upload my data using the 'Add Custom Tracks' thing? ... Also, the files I'm analyzing were downloaded and are already in .BED files. For each condition I have one 'peaks' file which I'm assuming is a positive control for analysis and a ' align' file. Are you suggesting I should sort them into another .BED file. Thanks.

ADD REPLY
0
Entering edit mode

No, I would not upload data, but download RefSeq gene names to a local BED file, and then do the mapping between peaks and RefSeq gene names with BEDOPS and standard UNIX command line tools.

If your data are already in BED format, you can just do something like: $ sort-bed peaks.bed > sorted_peaks.bed to ensure that the data are sorted. BEDOPS tools were written to take advantage of sorted BED data.

I know you said that you are not a programmer, but learning the command line is important enough that I figured I would suggest that approach with some example commands that give you the answer you want. I hope this answer gives you a taste of what you can do.

ADD REPLY
0
Entering edit mode

Hi alex! I'm wanting to do something similar but I only have the start site position in my sorted BED file e.g.

chr1 9061386
chr1 9061390
...

it has about ~20 lines, unfortunately my system doesn't have the most up to date version of gcc req'd for BEDOPS installation... is there another way of doing this at the command line without using BEDOPS? (yes I could install more up to date versions of gcc and g++ but that's currently out of my hands)

Look forward to your response!

ADD REPLY
0
Entering edit mode
10.1 years ago
Ian 6.1k

GALAXY is your friend here. If you are going to be working with data based on genome coordinates then it will be you friend for "life" :)

You can upload you ChIP-seq regions, then (following Alex Reynolds instructions steps 1 to 4) upload the BED file of refseq genes. You can overlap the genome coordinates using the intersect tool. You may want to extend the refseq coordinates beyond the boundaries of the transcripts to increase the range of the possible overlap.

ADD COMMENT
0
Entering edit mode
ADD COMMENT
0
Entering edit mode
7.8 years ago

Here is yet another Python solution using pyensembl:

import pyensembl
ensembl = pyensembl.EnsemblRelease(release=75)
gene_names = ensembl.gene_names_at_locus(contig=chrom, position=pos)
ADD COMMENT

Login before adding your answer.

Traffic: 3640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6