We have an Entrez Gene XML of 2.2Gb of bovine genome (gene_result.xml) and we are trying to map positions between two assemblies.
In some cases there are two different positions for the same assembly (see Seq-intervalfrom below), Gene-commentaryversion number is the same and the DTD/XSD NCBI schemas don't have any particular comment for this case. We want to map one to one interval positions from the UMD 3.1 assembly and reference assemble Btau_4.2. How would you "disambiguate"? do you think we should assume the last one is the updated one?
You may obtain this XML from the NCBI FTP binary ASN.1 at location gene/DATA/ASNBINARY/Mammalia/Bostaurus.ags.gz
<Entrezgene>
...
<Entrezgene_gene>
<Gene-ref>
<Gene-ref_locus>ATP6V1A</Gene-ref_locus>
<Gene-ref_desc>ATPase, H+ transporting, lysosomal 70kDa, V1 subunit A</Gene-ref_desc>
<Gene-ref_maploc>1</Gene-ref_maploc>
...
<Entrezgene_locus>
<Gene-commentary>
<Gene-commentary_type value="genomic">1</Gene-commentary_type>
<Gene-commentary_heading>Reference assembly (based on Btau_4.2)</Gene-commentary_heading>
<Gene-commentary_label>chromosome</Gene-commentary_label>
<Gene-commentary_accession>NC_007299</Gene-commentary_accession>
<Gene-commentary_version>4</Gene-commentary_version>
<Gene-commentary_seqs>
<Seq-loc>
<Seq-loc_int>
<Seq-interval>
<Seq-interval_from>59288053</Seq-interval_from>
<Seq-interval_to>59336577</Seq-interval_to>
<Seq-interval_strand>
<Na-strand value="plus"/>
...
<Gene-commentary>
<Gene-commentary_type value="genomic">1</Gene-commentary_type>
<Gene-commentary_heading>Reference assembly (based on Btau_4.2)</Gene-commentary_heading>
<Gene-commentary_label>chromosome</Gene-commentary_label>
<Gene-commentary_accession>NC_007299</Gene-commentary_accession>
<Gene-commentary_version>4</Gene-commentary_version>
<Gene-commentary_seqs>
<Seq-loc>
<Seq-loc_int>
<Seq-interval>
<Seq-interval_from>59346176</Seq-interval_from>
<Seq-interval_to>59352092</Seq-interval_to>
<Seq-interval_strand>
<Na-strand value="plus"/>
</Seq-interval_strand>
Thanks for the Liftover link. However your first part of reply confused me even more in the good sense. If a gene might have more than one position, what's a gene then?
IMHO, the best definition of gene is "any functional genomic unit", though this is prone to vary between biologists. A less ambiguous term for the data you are trying to map would be "locus" since what you have is a position -> transcript mapping. I have no pretense of being a reference on terminology, though.
I would think of the data you got off Entrez to mean: "We have figured out either from evidence or prediction that Bos taurus has a functional ATP6V1A gene, and based on currently available genomic assemblies, it could come from either 'here. or 'there'".