I'm interested in analyzing gene and mRNA annotations for the human genome. In addition to the exon/intron structure and CDS for each mRNA, I would like to know which mRNAs correspond to the same gene.
I first tried using genePredToGtf from Jim Kent's tools to download the annotations in GTF format, and then I subsequently tried working directly with the knownGene table data (knownGene.txt.gz). However, I've encountered a couple of issues that make it difficult to resolve gene-mRNA relationships in these data.
- It seems the gene_id attribute is not unique--that is, some distinct gene loci share the same gene_id value. Consider Q15005--if you run
grep Q15005 knownGene.txt
, you'll find a gene on chr1 and a different one on chr11 that both use this gene_id. A little digging revealed that the one on chr11 is a legitimate protein-coding gene, and the one on chr1 is a pseudogene, but the knownGene data does not include this information. - I'm only interested in protein-coding genes. Many entries in knownGene have CDS start = CDS end, and in the GTF files these simply have no CDS features. I assume I can safely ignore all of these as pseudogenes or other non-protein-coding genes. However, as the last example showed, some pseudogenes had a CDS listed. There seems no other way to discriminate genes that code for proteins from genes that do not.
- Some transcripts that appear to belong to the same gene/locus have distinct gene_id values. Most (if not all) of these seem to belong to non-protein-coding genes, however. Since I'm only interested in protein-coding genes, this may not be a problem.
I must admit I'm quite surprised that this information isn't more easily accessible for the human genome. It seems like resolving gene-mRNA relationships should be an elementary task. Am I looking in the wrong place and/or using the wrong tools to look for this information?
Part of the solution to this problem is to realize that identifiers such as Q15005 are UniProt accession numbers and therefore not suitable as a "gene_id". Ensembl/BioMart is a good solution as outlined in Michael's answer.