How To Resolve The Gene To Mrnas Relationships For The Human Genome?
3
2
Entering edit mode
11.4 years ago

I'm interested in analyzing gene and mRNA annotations for the human genome. In addition to the exon/intron structure and CDS for each mRNA, I would like to know which mRNAs correspond to the same gene.

I first tried using genePredToGtf from Jim Kent's tools to download the annotations in GTF format, and then I subsequently tried working directly with the knownGene table data (knownGene.txt.gz). However, I've encountered a couple of issues that make it difficult to resolve gene-mRNA relationships in these data.

  • It seems the gene_id attribute is not unique--that is, some distinct gene loci share the same gene_id value. Consider Q15005--if you run grep Q15005 knownGene.txt, you'll find a gene on chr1 and a different one on chr11 that both use this gene_id. A little digging revealed that the one on chr11 is a legitimate protein-coding gene, and the one on chr1 is a pseudogene, but the knownGene data does not include this information.
  • I'm only interested in protein-coding genes. Many entries in knownGene have CDS start = CDS end, and in the GTF files these simply have no CDS features. I assume I can safely ignore all of these as pseudogenes or other non-protein-coding genes. However, as the last example showed, some pseudogenes had a CDS listed. There seems no other way to discriminate genes that code for proteins from genes that do not.
  • Some transcripts that appear to belong to the same gene/locus have distinct gene_id values. Most (if not all) of these seem to belong to non-protein-coding genes, however. Since I'm only interested in protein-coding genes, this may not be a problem.

I must admit I'm quite surprised that this information isn't more easily accessible for the human genome. It seems like resolving gene-mRNA relationships should be an elementary task. Am I looking in the wrong place and/or using the wrong tools to look for this information?

human annotation • 3.5k views
ADD COMMENT
0
Entering edit mode

Part of the solution to this problem is to realize that identifiers such as Q15005 are UniProt accession numbers and therefore not suitable as a "gene_id". Ensembl/BioMart is a good solution as outlined in Michael's answer.

ADD REPLY
3
Entering edit mode
11.4 years ago
Michael 55k

How about Ensembl Biomart? something like this query maybe?

Btw, it seems like the exported URL contain a session ID, that way I can see what you are selecting ;) I think this is a bug. Just remove the sessionid between martview/ and ?. The following link is the real starting point and should also work after the session expires: http://www.ensembl.org/biomart/martview?VIRTUALSCHEMANAME=default&ATTRIBUTES=....

ADD COMMENT
0
Entering edit mode

+1. Adding gene and transcript biotype helps to discriminate between protein-coding and non-protein-coding genes/transcripts. I can also add exon ID, but it doesn't look like I can select exon coordinates, nor is any CDS information available. So although this query seems to solve the main problem I've been having (gene-mRNA relationships), it doesn't provide some of the other basic information.

ADD REPLY
0
Entering edit mode

If you choose "Sequences" instead of "Structures" you can select various kind of sequences including "Coding sequence" as FASTA files.

ADD REPLY
0
Entering edit mode

But this provides the actual sequences, not the annotation/coordinates of those sequences with respect to the genomic sequence.

ADD REPLY
0
Entering edit mode

That being said, if I look for "Structures" attributes instead of "Features" attributes, available exon information includes exon and CDS coordinates. So I would have to download and process two separate files, but it seems like everything I need is there. Thanks!

ADD REPLY
2
Entering edit mode
11.4 years ago
Devonj ▴ 90

At our company, we were encountering many of the same issues, so we recently switched from UCSC data to the NCBI data found here: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF/ref_GRCh37.p10_top_level.gff3.gz

Or there is more recently updated file here: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/GFF_interim/interim_GRCh37.p12_top_level.gff3_2013-04-10.gz

I believe this is the same data that is used to create the embedded chromosome maps you see at dbSNP and NCBI Gene (e.g. http://www.ncbi.nlm.nih.gov/gene/1080 )

It's more than just protein-coding genes, and requires a bit of parsing to get it into a format like UCSC, but for the most part quite good. (I could expound on a few issues we found if needed)

ADD COMMENT
0
Entering edit mode

+1 I was intending to generate GFF3 files anyway, so that's is a bonus. Discriminating between protein-coding genes and other genes looks straightforward. There are 4 or 5 mRNAs with issues (overlapping or adjacent exons), but these are easy to ignore.

ADD REPLY
1
Entering edit mode
11.4 years ago

Don't use the UCSC GTF file, download the one from Ensembl as it doesn't have the non-unique gene_id issue. I understand why the UCSC annotation is the way it is, but that makes things annoying in situations like yours (or when one tries to use DEXSeq on RNAseq reads, for the same reasons).

ADD COMMENT
0
Entering edit mode

I would suggest that this would be much more appropriate as a comment on Michael Dondrup's answer rather than an answer on its own.

ADD REPLY

Login before adding your answer.

Traffic: 1665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6