Question

Ensembl API - pseudoautosomal regions (PAR)

0

Entering edit mode

8.7 years ago

JS • 0

I access Ensembl data via their Perl API and retrieve information on genes, transcripts etc. I have made the observation that if I get data from their database's gene table there are genes which ocurr twice, once on the X and once on the Y chromosome. This affects 45 human genes, for 34/45 genes the start and end positions on X and Y are identical.

Two examples:

geneID	biotype	chromosome	start	end
ENSG00000002586	protein_coding	X	2691179	2741309
ENSG00000002586	protein_coding	Y	2691179	2741309
ENSG00000124333	protein_coding	X	155881293	155943769
ENSG00000124333	protein_coding	Y	57067813	57130289

When querying some of these genes via the Ensembl website it turned out that they are mapped to pseudoautosomal regions (identical sequence on X and Y).

Some more information on how I retrieve the data:

To speed things up I iterate over chromosomes in parallel and retrieve all genes as follows:

$slice = $slice_adaptor -> fetch_by_region('chromosome', $chr_name);
my @genes = @{$slice -> get_all_Genes()};

So basically ENSG00000002586 is in @genes when querying information on X and when querying information on Y. If I, however, go via the gene I only get the X chromosome:

my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );

my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000124333');

print $gene->seq_region_name(); # => X

On http://lists.ensembl.org/pipermail/dev/2010-October/000214.html they say that a gene might exceed a pseudoautosomal region and thus extend into a region unique to the Y chromosome. This could be a reason why a gene shows up for X and Y. However, I checked this and there is no overlap between unique regions of Y and the gene coordinates.

Questions

How come the positions are identical for some of the genes?
Has anyone observed this as well and figured out why one gets these duplicate gene entries?

gene genome • 2.3k views

ADD COMMENT • link updated 8.7 years ago by Ensembl Helen ▴ 60 • written 8.7 years ago by JS • 0

score 2 · Accepted Answer · 2016-12-07

As you have found, Y chromosome is partially identical to the X and these are designated pseudoautosomal regions. In PARs, the genes only appear on X because, in Ensembl, we do not duplicate them in the identical region of Y (internally there is no Y in that coordinate range). The original and unique annotation is on the X chromosome, but it appears on the Y if looked at from Y.

The PARs are mapped between X and Y as: Y:10001-2781479 to X:10001-2781479 and Y:56887903-57217415 to X:155701383-156030895

So any gene from X on the first block will have the same coordinates on Y, but as you are seeing in your second example (ENSG00000124333), this isn’t true for the second block.

We import the assembly from the GRC, and this is the same as their representation: https://www.ncbi.nlm.nih.gov/grc/human

Hopefully this helps.