Mapping Multiple Interval Positions In Bovine Genome
1
1
Entering edit mode
13.2 years ago

We have an Entrez Gene XML of 2.2Gb of bovine genome (gene_result.xml) and we are trying to map positions between two assemblies.

In some cases there are two different positions for the same assembly (see Seq-intervalfrom below), Gene-commentaryversion number is the same and the DTD/XSD NCBI schemas don't have any particular comment for this case. We want to map one to one interval positions from the UMD 3.1 assembly and reference assemble Btau_4.2. How would you "disambiguate"? do you think we should assume the last one is the updated one?

You may obtain this XML from the NCBI FTP binary ASN.1 at location gene/DATA/ASNBINARY/Mammalia/Bostaurus.ags.gz

<Entrezgene>
...
    <Entrezgene_gene>
        <Gene-ref>
            <Gene-ref_locus>ATP6V1A</Gene-ref_locus>
            <Gene-ref_desc>ATPase, H+ transporting, lysosomal 70kDa, V1 subunit A</Gene-ref_desc>
            <Gene-ref_maploc>1</Gene-ref_maploc>
...
    <Entrezgene_locus>
        <Gene-commentary>
            <Gene-commentary_type value="genomic">1</Gene-commentary_type>
            <Gene-commentary_heading>Reference assembly (based on Btau_4.2)</Gene-commentary_heading>
            <Gene-commentary_label>chromosome</Gene-commentary_label>
            <Gene-commentary_accession>NC_007299</Gene-commentary_accession>
            <Gene-commentary_version>4</Gene-commentary_version>
            <Gene-commentary_seqs>
                <Seq-loc>
                    <Seq-loc_int>
                        <Seq-interval>
                            <Seq-interval_from>59288053</Seq-interval_from>
                            <Seq-interval_to>59336577</Seq-interval_to>
                            <Seq-interval_strand>
                                <Na-strand value="plus"/>
...
        <Gene-commentary>
            <Gene-commentary_type value="genomic">1</Gene-commentary_type>
            <Gene-commentary_heading>Reference assembly (based on Btau_4.2)</Gene-commentary_heading>
            <Gene-commentary_label>chromosome</Gene-commentary_label>
            <Gene-commentary_accession>NC_007299</Gene-commentary_accession>
            <Gene-commentary_version>4</Gene-commentary_version>
            <Gene-commentary_seqs>
                <Seq-loc>
                    <Seq-loc_int>
                        <Seq-interval>
                            <Seq-interval_from>59346176</Seq-interval_from>
                            <Seq-interval_to>59352092</Seq-interval_to>
                            <Seq-interval_strand>
                                <Na-strand value="plus"/>
                            </Seq-interval_strand>
assembly ncbi gene • 2.5k views
ADD COMMENT
1
Entering edit mode
13.2 years ago
Eric Fournier ★ 1.4k

Genes might have more than one entry/position either because (1) their position is ambiguous given the current data or (2) A recent duplication event has yielded two functional loci. In either case, neither entry is "wrong", and there is no way to disambiguate between the two. If you really must have a one-to-one mapping, then you'll have to choose at random.

As an aside, what is it you are trying to do with your mapping? There are already tools that map genomic coordinates between assemblies (Example: Liftover), the authors of which have probably given this problem considerations similar to yours.

ADD COMMENT
0
Entering edit mode

Thanks for the Liftover link. However your first part of reply confused me even more in the good sense. If a gene might have more than one position, what's a gene then?

ADD REPLY
0
Entering edit mode

IMHO, the best definition of gene is "any functional genomic unit", though this is prone to vary between biologists. A less ambiguous term for the data you are trying to map would be "locus" since what you have is a position -> transcript mapping. I have no pretense of being a reference on terminology, though.

I would think of the data you got off Entrez to mean: "We have figured out either from evidence or prediction that Bos taurus has a functional ATP6V1A gene, and based on currently available genomic assemblies, it could come from either 'here. or 'there'".

ADD REPLY

Login before adding your answer.

Traffic: 2798 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6