Hi,
can any one help me in downloading the exon coordinates of all the genes present in the human genome hg19.
neeraj
Hi,
can any one help me in downloading the exon coordinates of all the genes present in the human genome hg19.
neeraj
Using mysql:
mysql -u anonymous -h ensembldb.ensembl.org -P 5306 -D homo_sapiens_core_61_37f -A
-e 'select S.stable_id,R.name,E.seq_region_start,E.seq_region_end,E.seq_region_strand from exon as E,seq_region as R,exon_stable_id as S where R.seq_region_id=E.seq_region_id and S.exon_id=E.exon_id'
+-----------------+------+------------------+----------------+-------------------+
| stable_id | name | seq_region_start | seq_region_end | seq_region_strand |
+-----------------+------+------------------+----------------+-------------------+
| ENSE00002029850 | 5 | 94120533 | 94120602 | -1 |
| ENSE00002069321 | 4 | 17835922 | 17836146 | 1 |
| ENSE00002048418 | 5 | 123731640 | 123731794 | -1 |
| ENSE00001815244 | 6 | 13711167 | 13711796 | -1 |
| ENSE00001363151 | 2 | 1507720 | 1507851 | 1 |
| ENSE00001737796 | 1 | 40537122 | 40537924 | 1 |
| ENSE00001800436 | 2 | 165208630 | 165208733 | -1 |
| ENSE00001255746 | 10 | 93683822 | 93683847 | 1 |
| ENSE00001844789 | 14 | 74523609 | 74523683 | 1 |
| ENSE00002137765 | 8 | 17541844 | 17542051 | -1 |
(...)
Using UCSC & awk
:
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c |\
awk '{n=int($8); split($9,S,/,/);split($10,E,/,/); for(i=1;i<=n;++i) {printf("%s,%s,%s,%s,%s\n",$1,$2,$3,S[i],E[i]);} }'
uc001aaa.3,chr1,+,11873,12227
uc001aaa.3,chr1,+,12612,12721
uc001aaa.3,chr1,+,13220,14409
uc010nxq.1,chr1,+,11873,12227
uc010nxq.1,chr1,+,12594,12721
uc010nxq.1,chr1,+,13402,14409
uc010nxr.1,chr1,+,11873,12227
uc010nxr.1,chr1,+,12645,12697
uc010nxr.1,chr1,+,13220,14409
uc009vis.2,chr1,-,14362,14829
several ways of doing this have been previously mentioned. the one I like the most because of its simplicity is using BioMart, selecting "martview", choosing the latest "Ensembl genes" database and the latest human dataset, and then selecting the attributes needed on the "structures" section (there you will have an "exon" subsection with "Exon Chr Start (bp)" and "Exon Chr End (bp)") without applying any filter at all.
Search BioStar and you will find a number of solutions to this problem. Mostly they use BioMart, as outlined by Jorge, or the UCSC genome browser database tables, as described in the answer pointed to by Pierre.
One point: which set of exon coordinates do you want? There are several, depending on the gene prediction method used - e.g. UCSC, RefSeq or Ensembl transcripts.
If you start from UCSC tables, select "mammal, human, hg19" and then "Genes and Gene Prediction tracks" under "group", you will see the various gene models. Select one of those and choose "describe table schema" to see how exons are stored. Then go back to the main tables page, from where you should be able to download the exon data. This can also be done programmatically or through a SQL query to the UCSC MySQL server.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Mostly, although that question asked for a Bioperl solution.
duplicate of Getting Genome Coordinates From Refseq Exon Mrna Position Data?
Mostly, all though that question asked for a Bioperl solution.