I access Ensembl data via their Perl API and retrieve information on genes, transcripts etc. I have made the observation that if I get data from their database's gene table there are genes which ocurr twice, once on the X and once on the Y chromosome. This affects 45 human genes, for 34/45 genes the start and end positions on X and Y are identical.
Two examples:
geneID | biotype | chromosome | start | end |
---|---|---|---|---|
ENSG00000002586 | protein_coding | X | 2691179 | 2741309 |
ENSG00000002586 | protein_coding | Y | 2691179 | 2741309 |
ENSG00000124333 | protein_coding | X | 155881293 | 155943769 |
ENSG00000124333 | protein_coding | Y | 57067813 | 57130289 |
When querying some of these genes via the Ensembl website it turned out that they are mapped to pseudoautosomal regions (identical sequence on X and Y).
Some more information on how I retrieve the data:
To speed things up I iterate over chromosomes in parallel and retrieve all genes as follows:
$slice = $slice_adaptor -> fetch_by_region('chromosome', $chr_name);
my @genes = @{$slice -> get_all_Genes()};
So basically ENSG00000002586 is in @genes when querying information on X and when querying information on Y. If I, however, go via the gene I only get the X chromosome:
my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000124333');
print $gene->seq_region_name(); # => X
On http://lists.ensembl.org/pipermail/dev/2010-October/000214.html they say that a gene might exceed a pseudoautosomal region and thus extend into a region unique to the Y chromosome. This could be a reason why a gene shows up for X and Y. However, I checked this and there is no overlap between unique regions of Y and the gene coordinates.
Questions
- How come the positions are identical for some of the genes?
- Has anyone observed this as well and figured out why one gets these duplicate gene entries?