Question

Exon Coordinates Of Hg19 Genome Download

4

Entering edit mode

14.1 years ago

Neeraj ▴ 150

Hi,

can any one help me in downloading the exon coordinates of all the genes present in the human genome hg19.

neeraj

exon coordinates hg human • 26k views

ADD COMMENT • link updated 4.1 years ago by ines • 0 • written 14.1 years ago by Neeraj ▴ 150

2

Entering edit mode

Mostly, although that question asked for a Bioperl solution.

ADD REPLY • link 14.1 years ago by Neilfws 49k

1

Entering edit mode

duplicate of Getting Genome Coordinates From Refseq Exon Mrna Position Data?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Mostly, all though that question asked for a Bioperl solution.

ADD REPLY • link 14.1 years ago by Neilfws 49k

Ram · Answer 1 · 2011-03-14

20

Entering edit mode

14.1 years ago

Pierre Lindenbaum 166k

Using mysql:

mysql -u anonymous -h ensembldb.ensembl.org -P 5306 -D homo_sapiens_core_61_37f -A 
-e 'select S.stable_id,R.name,E.seq_region_start,E.seq_region_end,E.seq_region_strand from exon as E,seq_region as R,exon_stable_id as S where R.seq_region_id=E.seq_region_id and S.exon_id=E.exon_id'
+-----------------+------+------------------+----------------+-------------------+
| stable_id       | name | seq_region_start | seq_region_end | seq_region_strand |
+-----------------+------+------------------+----------------+-------------------+
| ENSE00002029850 | 5    |         94120533 |       94120602 |                -1 | 
| ENSE00002069321 | 4    |         17835922 |       17836146 |                 1 | 
| ENSE00002048418 | 5    |        123731640 |      123731794 |                -1 | 
| ENSE00001815244 | 6    |         13711167 |       13711796 |                -1 | 
| ENSE00001363151 | 2    |          1507720 |        1507851 |                 1 | 
| ENSE00001737796 | 1    |         40537122 |       40537924 |                 1 | 
| ENSE00001800436 | 2    |        165208630 |      165208733 |                -1 | 
| ENSE00001255746 | 10   |         93683822 |       93683847 |                 1 | 
| ENSE00001844789 | 14   |         74523609 |       74523683 |                 1 | 
| ENSE00002137765 | 8    |         17541844 |       17542051 |                -1 | 
(...)

Using UCSC & awk:

curl  -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c |\
 awk '{n=int($8); split($9,S,/,/);split($10,E,/,/); for(i=1;i<=n;++i) {printf("%s,%s,%s,%s,%s\n",$1,$2,$3,S[i],E[i]);} }' 
uc001aaa.3,chr1,+,11873,12227
uc001aaa.3,chr1,+,12612,12721
uc001aaa.3,chr1,+,13220,14409
uc010nxq.1,chr1,+,11873,12227
uc010nxq.1,chr1,+,12594,12721
uc010nxq.1,chr1,+,13402,14409
uc010nxr.1,chr1,+,11873,12227
uc010nxr.1,chr1,+,12645,12697
uc010nxr.1,chr1,+,13220,14409
uc009vis.2,chr1,-,14362,14829

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

split($9,S,/,/) = split the column $9 ($9 is a comma-separated list of exonStarts) and put the result into the variable S. split($10,E,/,/) = split the column $10 ($10 is a comma-separated list of exonEnds) and put the result into the variable E.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Pierre can you explain the awk command a little? it's hard for me to follow (but i already upvoted anyway)

ADD REPLY • link 14.1 years ago by brentp 24k

0

Entering edit mode

and $8 is the number of exons

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

thanks Pierre, that helps.

ADD REPLY • link 14.1 years ago by brentp 24k

0

Entering edit mode

I have an issue with this... if I want only the exons of the main isoform, how can I extract them? because from this file there are some items that are the same, like uc001aaa.3,chr1,+,11873,12227 or uc010nxq.1,chr1,+,11873,12227

ADD REPLY • link 4.1 years ago by ines • 0

score 5 · Answer 2 · 2011-03-14

5

Entering edit mode

14.1 years ago

Jorge Amigo 14k

several ways of doing this have been previously mentioned. the one I like the most because of its simplicity is using BioMart, selecting "martview", choosing the latest "Ensembl genes" database and the latest human dataset, and then selecting the attributes needed on the "structures" section (there you will have an "exon" subsection with "Exon Chr Start (bp)" and "Exon Chr End (bp)") without applying any filter at all.

ADD COMMENT • link 14.1 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanx Jorge it really helps me.Thanx a lot

ADD REPLY • link 14.1 years ago by Neeraj ▴ 150

0

Entering edit mode

Why the result of using Biomart is different from the result of using ensemble API ?

ADD REPLY • link 12.4 years ago by siyu ▴ 150

Ram · Answer 3 · 2011-03-14

Search BioStar and you will find a number of solutions to this problem. Mostly they use BioMart, as outlined by Jorge, or the UCSC genome browser database tables, as described in the answer pointed to by Pierre.

One point: which set of exon coordinates do you want? There are several, depending on the gene prediction method used - e.g. UCSC, RefSeq or Ensembl transcripts.

If you start from UCSC tables, select "mammal, human, hg19" and then "Genes and Gene Prediction tracks" under "group", you will see the various gene models. Select one of those and choose "describe table schema" to see how exons are stored. Then go back to the main tables page, from where you should be able to download the exon data. This can also be done programmatically or through a SQL query to the UCSC MySQL server.