Alternatives to UCSC genome browser for obtaining gene coordinates
3
1
Entering edit mode
3 days ago
shpak.max ▴ 60

If I have a long list of HUGO gene names for which I need to obtain the hg19 coordinates, what are some of the alternatives to using the UCSC Table Browser? The latter is awkward because for each gene, it returns a large number of entries corresponding to alternative exon splicing, so even if I just request chromosome and cds start + stop, I get a large (and varied) number of entries per gene which makes the output difficult to parse.

Basically, I just need a site where I can submit my gene list and get chr, start, stop for each gene, with one output line per input line. Are there some other options?

browser ucsc • 338 views
ADD COMMENT
3
Entering edit mode
1 day ago

Some people call what you want the "canonical" or "best" transcript. It's all documented here in UCSC's FAQ: https://genome.ucsc.edu/FAQ/FAQgenes.html

Especially the question "how can I download a single transcript per gene": https://genome.ucsc.edu/FAQ/FAQgenes.html#singledownload

In short, you have many options. Personally, I like the RefSeq subset "HGMD" for clinical applications and "RefSeq Select" for anything else. You can get these either from the downloads server or by clicking the "Data format and download button" directly from the track documentation. Also, the gene tracks all have a link to the FAQ.

More concretely, for the RefSeq select tracks:

curl -s https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/ncbiRefSeqSelect.txt.gz | zcat | cut -f3,5,6,13 | head -3

chr1    65418   71585   OR4F5
chr1    367658  368597  OR4F29
chr1    621095  622034  OR4F16

The latter is awkward because for each gene, it returns a large number of entries corresponding to alternative exon splicing

These entries are called transcripts and they are essential in genomics. A gene really is a list of transcripts. It would be good to know more what you're trying to do, because often it's a bad idea to reduce genes to just the canonical transcript. Almost always you can now select transcripts but instead work with the full list and get more complete results.

Incidentally, among other issues, UCSC Table browser feature seems to be buggy - for about 1/5 of my genes, it returns the same value for the start/stop position

This are probably non-coding genes and the positions are not start-stop but coding start-stop and the fact that these are the same means that there is no coding sequence. It's also explained in the FAQ: https://genome.ucsc.edu/FAQ/FAQgenes.html#coding which says:

If using the UCSC knownGene table, one can filter for where the coding start and coding end fields of the table are equivalent, e.g. knownGene.cdsStart = knownGene.cdsEnd, which would ensure the selected entries are non-coding genes.

ADD COMMENT
2
Entering edit mode
2 days ago
GenoMax 150k

Get the basic gene annotation from https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh37_mapping/gencode.v47lift37.basic.annotation.gtf.gz

Use

  $  zcat gencode.v47lift37.basic.annotation.gtf.gz | awk '$3 == "gene" { 
        # Extract gene_name from the last column using regular expression
        match($0, /gene_name "([^"]+)"/, arr); 
        if (arr[1] != "") print $1, $4, $5, arr[1] 
    }'

to generate a table of gene start and stops

chr1 10370 13118 DDX11L2
chr1 11121 24894 DDX11L16
chr1 12010 13670 DDX11L1
chr1 14356 30744 WASH7P
chr1 14696 24886 WASH7P
chr1 28589 31109 MIR1302-2HG
chr1 34553 37595 FAM138A
chr1 36526 40778 ENSG00000308361
chr1 51891 64116 ENSG00000290826
chr1 52473 53312 OR4G4P
chr1 62949 63887 OR4G11P
chr1 65419 71585 OR4F5
chr1 89551 91105 ENSG00000239945
chr1 96638 105742 ENSG00000308314
chr1 131025 134836 CICP27
chr1 134901 139379 AL627309.1
chr1 134914 136246 ENSG00000308579

Extract what you need from here with grep -wf and list of your genes in a file (one per line).

ADD COMMENT
0
Entering edit mode

Thank you for the reference, I'll look into it.

Incidentally, among other issues, UCSC Table browser feature seems to be buggy - for about 1/5 of my genes, it returns the same value for the start/stop position (usually duplicating the cds stop position), while for the others the start-stop more or less matches what one would get by using their web browser.

ADD REPLY
2
Entering edit mode
20 hours ago
LauferVA 4.6k

shpak.max ,

First, bravo for specifying both a reference build (hg19) and a standardized gene identifier system (HUGO gene names).

As GenoMax and Maximilian Haeussler have pointed out in their excellent answers you can select a “canonical” or “best” transcript (such as using the RefSeq Select tracks). This is a solid, practical answer and remains state of the art in many areas. However, ultimately, there is more than one phenotypically relevant transcript per gene for a LOT of genes. One good resource for this is MANE plus Clinical.

I wanted to make the same recommendation, but also wanted to give you an example of a case in which getting the right answer actually depends on correct transcript isoform selection.

Example: A missed diagnosis of Kleefstra Syndrome in a 2 year old girl.

Kleefstra syndrome is a neurodevelopmental disorder characterized by intellectual disability, hypotonia, distinctive facial features, and often congenital heart defects. It usually results from haploinsufficiency EHMT1 a gene that mediates chromatin modification and transcriptional regulation. However, loss of function mutations in KMT2C can also lead to a Kleefstra‐like phenotype ...

Now, consider this article. In the report, the analysis of the microarray data uses the basic gene annotation of KMT2C.

Deletions in the proximal exons of KMT2C are frequently seen in healthy individuals and are generally considered benign. However, this patient had a deletion in a more distal region that specifically affects an isoform of KMT2C made in the brain and thats required for normal neural development. But, because this analysis didn't distinguish between the various isoforms of KMT2C, the deletion was overlooked and not flagged as pathogenic. As a result the little girl was not diagnosed with Kleefstra-like syndrome :-(

This same article provides two additional examples.

Pull normal and brain-specific transcript isoforms (and save the little girl)!!

The first command below extracts the basic gene-level coordinates from the Gencode basic annotation first, then the transcript-level coordinates for a specific isoform from the full annotation file. (here we use ENST00000496432.1 as an illustrative brain-expressed transcript; these are found by consulting biomedical literature).

Download the annotation files for GRCh37 (hg19)

curl -O https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh37_mapping/gencode.v47lift37.basic.annotation.gtf.gz
curl -O https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh37_mapping/gencode.v47lift37.annotation.gtf.gz

First extract gene-level coordinates for KMT2C from the basic annotation:

echo "Gene-level coordinates for KMT2C (basic annotation):"
zcat gencode.v47lift37.basic.annotation.gtf.gz | \
awk '$3=="gene" && /gene_name "KMT2C"/ { print $1, $4, $5, "KMT2C" }'

Second extract transcript-level coordinates for a brain-expressed isoform.

echo "Transcript-level coordinates for brain-expressed KMT2C isoform (ENST00000496432.1):" 
zcat gencode.v47lift37.annotation.gtf.gz | \
awk '$3=="transcript" && /gene_name "KMT2C"/ && /ENST00000496432.1/ { print $1, $4, $5, "ENST00000496432.1" }'

Now, you can search your variants against the coordinates of both isoforms, and see if anything is awry in either. In practice, folks use many heuristics and databases to do this quickly.

ADD COMMENT

Login before adding your answer.

Traffic: 1705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6