Question

UCSC different exome sets per each gene

0

Entering edit mode

5.6 years ago

cocchi.e89 ▴ 290

I'm trying to collect the exomes' start-end for a set of gene. I downloaded the UCSC tables, but I found out that different sets are outputted for each gene. As example UMOD gene:

#hg19.knownGene.name    hg19.knownGene.chrom    hg19.knownGene.txStart  hg19.knownGene.txEnd    hg19.knownGene.exonCount    hg19.knownGene.exonStarts   hg19.knownGene.exonEnds hg19.kgXref.geneSymbol
uc002dgz.3  chr16   20344372    20364037    11  20344372,20346803,20347967,20348612,20352412,20355345,20357447,20359544,20359757,20361971,20364010, 20344697,20346842,20348049,20348775,20352658,20355494,20357656,20359652,20360534,20362161,20364037, UMOD
uc002dha.3  chr16   20344372    20364037    11  20344372,20346803,20347967,20348612,20352412,20355345,20357447,20359544,20359757,20361971,20364010, 20344697,20346842,20348049,20348775,20352658,20355494,20357656,20359652,20360534,20362098,20364037, UMOD
uc002dhb.3  chr16   20344372    20364037    12  20344372,20346803,20347967,20348612,20352412,20355345,20357447,20359544,20359757,20361092,20361971,20364010,    20344697,20346842,20348049,20348775,20352658,20355494,20357656,20359652,20360534,20361191,20362161,20364037,    UMOD

so my question is: which one am I supposed to rely on? And how do you select it?

Thanks a lot in advance for any help!

ucsc exomes gene coordinates • 1.6k views

ADD COMMENT • link updated 5.6 years ago by Luis Nassar ▴ 670 • written 5.6 years ago by cocchi.e89 ▴ 290

score 0 · Answer 1 · 2019-05-24

0

Entering edit mode

5.6 years ago

Pierre Lindenbaum 164k

knownGene a misleading name. What you're seeing here are 3 transcripts (uc002dgz.3; uc002dha.3 uc002dhb.3 ) for the same gene. UMOD

one am I supposed to rely on?

There is no quick answer for this: depends of your needs: the largest, the most covered, etc... or just use the min / max coordinates.

ADD COMMENT • link 5.6 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

probably the largest, shall I calculate it for each one? Or there is a sort of "indicator" of the largest set?

ADD REPLY • link 5.6 years ago by cocchi.e89 ▴ 290

score 0 · Answer 2 · 2019-05-24

0

Entering edit mode

5.6 years ago

Luis Nassar ▴ 670

As Pierre mentioned, knownGene includes a large set of transcripts, in total it has 82,960 items. If you are just looking for one representative transcript per gene, then I would recommend you use the knownCanonical table instead. This table is a subset of the knownGene, generally the longest isoform. You may search our forums for more details on its generation (https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome).

Here is a link to the .txt.gz knownCannonical data: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownCanonical.txt.gz

You may also get your query from the Table Browser, using the following link (http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_doMainPage=1&hgta_group=genes&hgta_track=knownGene&hgta_table=knownGene) and following these steps:

change the table to knownCannonical then change the output to selected fields from primary and related tables
add a file name to download the file
get output
select chrom, chromStart, chromEnd, protein, and geneSymbol
get output

The results should look as follows:

#hg19.knownCanonical.chrom  hg19.knownCanonical.chromStart  hg19.knownCanonical.chromEnd    hg19.knownCanonical.protein hg19.kgXref.geneSymbol
chr1    11873   14409   uc010nxq.1  DDX11L1
chr1    14361   19759   uc009viu.3  WASH7P
chr1    14406   29370   uc009viw.2  WASH7P
chr1    34610   36081   uc001aak.3  FAM138F
...

Lou UCSC GB

ADD COMMENT • link 5.6 years ago by Luis Nassar ▴ 670

0

Entering edit mode

Thank you very much for the suggestions Luis, but I need the start-end of each exome, not of the overall gene.

ADD REPLY • link 5.6 years ago by cocchi.e89 ▴ 290

1

Entering edit mode

Ah, in that case you can change the output to BED then get output and in the following page you will see:

Create one BED record per:

If you select Coding Exons (or Exons plus 0 if you want to include UTR regions), you should get an output like such, with one entry for each exon:

chr1    12189   12227   uc010nxq.1_cds_0_0_chr1_12190_f 0   +
chr1    12594   12721   uc010nxq.1_cds_1_0_chr1_12595_f 0   +
chr1    13402   13639   uc010nxq.1_cds_2_0_chr1_13403_f 0   +
chr1    69090   70008   uc001aal.1_cds_0_0_chr1_69091_f 0   +

ADD REPLY • link 5.6 years ago by Luis Nassar ▴ 670

0

Entering edit mode

if I leave "UCSC Genes" in the query page it doesn't allow me to select "Coding Exons" in the BED page, but if I change it to "NCBI RefSeq" it then does and I get:

chr1    67000041    67000051    NM_001308203.1_cds_1_0_chr1_67000042_f  0   +
chr1    67091529    67091593    NM_001308203.1_cds_2_0_chr1_67091530_f  0   +
chr1    67098752    67098777    NM_001308203.1_cds_3_0_chr1_67098753_f  0   +

but how can I put then the gene symbol here? Or retrieve it from somewhere else...

ADD REPLY • link 5.6 years ago by cocchi.e89 ▴ 290

0

Entering edit mode

Found a good answer finally at: How To Get Bed File Containing Exons Of Canonical Transcripts And Their Corresponding Gene Symbols

ADD REPLY • link 5.6 years ago by cocchi.e89 ▴ 290

0

Entering edit mode

Oh, I see, I believe I understand, you're looking for each of the individual exon start/stop for one isoform per gene?

You're right, the knownCannonical table does not have exon start/stop coordinates. The following should work for you though:

Choose the knownCannonical table as described above, give a file name to download, then Select fields from primary....
Choose only protein then get output to download the file

This gives you a file with each cannonical transcript ID, e.x.:

#protein
uc010nxq.1
uc009viu.3
uc009viw.2

Go back to the table browser and switch to the knownGene table
For identifiers (names/accessions): choose upload list and select the file you just created

You should see a message that Note: 1 of the 31849 failed to upload, which is the '#protein' file header. At this point you are restricting the knownGene data set to just one isoform.

Add a file name to download, then Select fields from primary.... get output
Choose the fields you want, e.x. name, chrom, strand, exonCount, exonStarts, exonEnds, geneSymbol, then get output

This output should give you a list of 31849 cannonical isoforms with individual exon start/stop sites, e.x.

#hg19.knownGene.name    hg19.knownGene.chrom    hg19.knownGene.strand   hg19.knownGene.exonCount    hg19.knownGene.exonStarts   hg19.knownGene.exonEnds hg19.kgXref.geneSymbol
uc010nxq.1  chr1    +   3   11873,12594,13402,  12227,12721,14409,  DDX11L1
uc009viu.3  chr1    -   10  14361,14969,15795,16606,16857,17232,17914,18267,18500,18912,    14829,15038,15947,16765,17055,17742,18061,18369,18554,19759,    WASH7P
uc009viw.2  chr1    -   7   14406,16857,17232,17914,18267,24737,29320,  16765,17055,17742,18061,18366,24891,29370,  WASH7P
uc001aak.3  chr1    -   3   34610,35276,35720,  35174,35481,36081,  FAM138F

ADD REPLY • link 5.6 years ago by Luis Nassar ▴ 670