List Of Dbsnp Variants Where Reference Genome Has Minor Allele
3
2
Entering edit mode
13.0 years ago

While working on a couple of exome projects, I've ran into situations where for some of the variants we are calling, the reference genome allele is annotated in dbSNP as the minor allele (MAF<5%), and as such the variants are not so interesting to us.

Here is an example:rs7185 < http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7185 > ref=C -> alt=T MAF/MinorAlleleCount: C=0.142/311

I would like to filter those out, but cannot find a simple file with that information... I am using the 00-All.vcf.gz from dbSNP to identify known variants, which lists whether some of the alleles are low frequency, but not which one is which...

I would like to be able to flag these variants - does someone know where I can get that data from, ideally somewhere that gets updated with dbSNP versions? Thank you so much!

reference dbsnp variant exome • 5.3k views
ADD COMMENT
3
Entering edit mode
13.0 years ago

Generate some java classes using ${JAVA_HOME}/bin/xjc and compile the java program below:

 xjc "ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_3.3.xsd" 
 javac Biostar15272.java gov/nih/nlm/ncbi/snp/docsum/*.java

Source:

import gov.nih.nlm.ncbi.snp.docsum.*;

import java.util.ArrayList;
import java.io.InputStream;
import java.util.List;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;

/**
 xjc "ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_3.3.xsd" 
 javac Biostar15272.java gov/nih/nlm/ncbi/snp/docsum/*.java
*/
public class Biostar15272
    {

    private void run(InputStream in) throws Exception
        {

        XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
        xmlInputFactory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE);
        xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
        xmlInputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.TRUE);
        JAXBContext jaxbCtxt=JAXBContext.newInstance(Rs.class);
        Unmarshaller unmarshaller=jaxbCtxt.createUnmarshaller();

            XMLEventReader reader= xmlInputFactory.createXMLEventReader(in); 
            boolean headerPrinted=false;
            while(reader.hasNext())
                {
                XMLEvent evt=reader.peek();
                if(!evt.isStartElement())
                    {
                    reader.nextEvent();
                    continue;
                    }
                StartElement start=evt.asStartElement();
                String localName=start.getName().getLocalPart();

                if(!localName.equals("Rs"))
                    {
                    reader.nextEvent();
                    continue;
                    }

               Rs rs=unmarshaller.unmarshal(reader, Rs.class).getValue();
               for(Rs.Frequency f:rs.getFrequency())
                    {
                    System.out.println("rs"+rs.getRsId()+
                        "\t"+f.getFreq()+
                        "\t"+f.getAllele()
                        );
                    }
               }
            reader.close();
        }
    public static void main(String[] args)
        {
        try
            {
            new Biostar15272().runSystem.in);
            }
        catch (Exception e)
            {
            e.printStackTrace();
            }
        }
    }

and then invoke this parser for dbsnp to print the frequencies:

curl -s  "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz" |\
gunzip -c | java -cp . Biostar15272

rs783   0.388   G
rs805   0.3067  C
rs820   0.1901  G
rs1056  0.2984  G
rs1118  0.3528  C
rs1119  0.4817  G
rs1222  0.1161  C
rs1312  0.4685  C
rs1314  0.1229  G
rs1315  0.1499  C
rs1317  0.1499  T
rs1320  0.0297  A
rs1321  0.3441  C
rs1608  0.1024  T
rs1654  0.1559  G
rs1656  0.1878  G
rs1682  0.1522  C
rs1803  0.2399  T
rs1858  0.0347  T
rs1859  0.0119  T
rs1860  0.3259  A
rs1866  0.308   T
rs1878  0.3729  A
rs1900  0.1463  C
rs1916  0.3286  C
rs1947  0.0718  C
rs1953  0.1622  T
rs2387  0.3048  T
rs2393  0.0178  C
rs2481  0.2893  A
rs2564  0.1545  C
rs2731  0.3857  G
rs2782  0.362   T
rs2913  0.3921  G
rs3179  0.4735  C
rs3233  0.2715  C
(...)
ADD COMMENT
0
Entering edit mode

Thank you so much Pierre. This is exactly what I need.

ADD REPLY
2
Entering edit mode
13.0 years ago
User 59 13k

You might also want to take a look at this PLoS paper from Dewey et al. where they create synthetic 'major allele' references for various populations.

ADD COMMENT
0
Entering edit mode

Key here is "for various populations" because what is a minor allele in one population could be major in another. The reference genome could represent some degree of admixture, for example, and so that "minor" allele may be major from the population contributing to the admixture.

ADD REPLY
0
Entering edit mode

Interesting paper Daniel - thanks! I don't have population information for the current datasets... for now, I am planning to filter these particular alleles only when they are also annotated as G5A (i.e. reference genome is minor allele, and minor allele is <5% in all populations).

ADD REPLY
0
Entering edit mode

Interesting paper Daniel - thanks! I don't have population information for the current datasets... for now, I am planning to tag these particular alleles and use that in combination with the G5A annotation to de-prioritize these variants (i.e. reference genome is minor allele, and minor allele is <5% in all populations, then the sample allele is not that interesting to us).

ADD REPLY
1
Entering edit mode
8.9 years ago
mike ▴ 60

@Pierre, many thanks for your post. I've actually got nothing after running

curl -s "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz" |\
gunzip -c | java -cp . Biostar15272

Do you know any reason? I've just followed your descriptions above.

Thanks for your comment.

ADD COMMENT
0
Entering edit mode

It's a problem in the new files from the NCBI, I've just left a message to the helpdesk:

Hi NCBI,

I think there is an error in the XML / XSD for dbsnp.

in "ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_3.4.xsd" the target Namespace is

targetNamespace="https://www.ncbi.nlm.nih.gov/SNP/docsum"
  

while in the XML it 'http', not 'httpS'

   curl -s  "ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz" | gunzip -c
   
   <ExchangeSet xmlns:xsi="&lt;a href=" http:="" www.w3.org="" 2001="" XMLSchema-instance"="" rel="nofollow">http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm.nih.gov/SNP/docsum"
  
ADD REPLY
0
Entering edit mode

Thanks a lot. Could you maybe also update when NCBI replied? It's really appreciated.

ADD REPLY
0
Entering edit mode

no, and don't expect an answer during those days :-)

ADD REPLY
0
Entering edit mode

here is an update where I used sed to convert 'http' to 'https'

ADD REPLY
0
Entering edit mode

Thanks so much. It did help!

ADD REPLY
0
Entering edit mode

This looks great!

ADD REPLY
0
Entering edit mode

furthermore, the schema have changed since 4 years.

ADD REPLY

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6