High proportion of non-biallelic SNPs in dbSNP 155 vs 144
2
0
Entering edit mode
2.3 years ago
Al Murphy ▴ 40

We have noted a substantial increase the proportion of non-biallelic SNPs across dbSNP 144 and 155 (increased from ~1% to 24%). This is explained in detail here and summarised in the code below.

Note that this is based on the dbSNP versions in Bioconductor (SNPlocs.Hsapiens.dbSNP) available in the devel version of Bioconductor, so we would like any confirmation that this aligns with dbSNP versions outside of Bioconductor but there doesn't seem to be any numbers available online about it? The only piece of work about this is quite old now and doesn't give specific numbers SNPs that come in threes. It would be great if anyone has any insight on this?

This is a little unnerving given (from what I could find anyway) a lack of documentation around these numbers and given common advice is to remove non-biallelic SNPs before further analysis (for example, here)

Also posted here: https://github.com/hpages/SNPlocsForge/issues/1

Code to replicate our findings:

#wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144.txt.gz"
#gunzip snp144.txt.gz
snps <- limma::read.columns("~/Downloads/snp144.txt", 
                            required.col=c("rs775809821"))$rs775809821
#missing header so add in SNP
snps <- c("rs775809821",snps)

library(BSgenome)
library(data.table)
extractAlleles <- function(rsids, dbSNP, ref_genome="GRCh38")
{
  pkgname <- paste0("SNPlocs.Hsapiens.dbSNP", dbSNP, ".", ref_genome)
  pkgenv <- loadNamespace(pkgname)
  snplocs <- get(pkgname, envir=pkgenv, inherits=FALSE)
  snps <- snpsById(snplocs, rsids, ifnotfound="drop")
  unname(setNames(IUPAC_CODE_MAP[mcols(snps)$alleles_as_ambig],
                  mcols(snps)$RefSNP_id)[rsids])
}

dbSNP144_alleles <- extractAlleles(snps, 144)
dbSNP155_alleles <- extractAlleles(snps, 155)
alleles <- data.frame(rsid=snps, dbSNP144_alleles, dbSNP155_alleles)
alleles <- data.table::setDT(alleles)
#remove where alleles missing for either
alleles <- alleles[!(is.na(dbSNP144_alleles) | is.na(dbSNP155_alleles))]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)==2,
        match:="both biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)>2,
        match:="both non-biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)==2,
        match:="dbSNP 144 non-biallelic"]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)>2,
        match:="dbSNP 155 non-biallelic"]
summ <- alleles[,.N,by=match]
summ[,prop:=N/sum(summ$N)]
#> summ
#match         N         prop
#1:          both biallelic 100605258 0.7427438640
#2: dbSNP 155 non-biallelic  32266661 0.2382168183
#3:      both non-biallelic   2565775 0.0189424855
#4: dbSNP 144 non-biallelic     13116 0.0000968322

And just to visually represent them:

library(ggplot2)
library(cowplot)
pastel_cols <- c("#9A8822","#F5CDB4","#F8AFA8",
                 "#FDDDA0","#74A089","#85D4E3",
                 #added extra to make 7
                 '#78A2CC')
ggplot(summ, aes(x = factor(match,levels=rev(c("both biallelic",
                                           "both non-biallelic",
                                          "dbSNP 155 non-biallelic",
                                          "dbSNP 144 non-biallelic"))), y = N))+
  geom_bar(stat = "identity", fill = pastel_cols[c(2,3,2,2)])+
  geom_text(label = with(summ, paste(prettyNum(N,big.mark=",",scientific=FALSE), 
                                     paste0('\n (', round(prop*100,2), '%)'))), 
            hjust = -.3)+
  ylim(0,max(summ$N)+20000000)+
  coord_flip() + 
  theme_cowplot()+
  theme(axis.title.y=element_blank())

dbSNP 144 SNPs and how they changed in dbSNP 155

genetics GWAS dbSNP sumstats • 2.8k views
ADD COMMENT
3
Entering edit mode
2.3 years ago
LChart 4.6k

I can confirm that the marginal proportion of multi-allelic SNVs are:

141: 1.9% 151: 6.8% Current: 14.5%

Raw (marginal) counts are:

# current
zcat GCF_000001405.39.gz | grep -v ^# | sed 's/RS=.*VC=\([^;]\+\).*/\1/g' | cut -f 5,8 | awk '{ if(index($1, ",") == 0) { print("BI",$2) } else { print("MU",$2)}}' | python -c "import sys; import pprint; import collections; pprint.pprint(collections.Counter(sys.stdin))"

Counter({'BI SNV\n': 857075416,
         'MU SNV\n': 145921863,
         'BI INDEL\n': 70251507,
         'BI DEL\n': 23115794,
         'MU INDEL\n': 7862733,
         'BI INS\n': 6831820,
         'MU INS\n': 1134726,
         'BI MNV\n': 293500,
         'MU DEL\n': 60568,
         'MU MNV\n': 6664})

# v151
$ zcat 00-All.vcf.gz | grep -v ^# | sed 's/RS=.*VC=\([^;]\+\).*/\1/g' | cut -f 5,8 | awk '{ if(index($1, ",") == 0) { print("BI",$2) } else { print("MU",$2)}}' | python -c "import sys; import pprint; import collections; pprint.pprint(collections.Counter(sys.stdin))"                                                      
Counter({'BI SNV\n': 537756044,
         'BI DIV\n': 59111721,
         'MU SNV\n': 39345133,
         'MU DIV\n': 1947237,
         'BI MNV\n': 209136,
         'MU MNV\n': 18695,
         'BI 0\n': 39})

# v144
zcat 00-All-144.vcf.gz | grep -v ^# | sed 's/RS=.*VC=\([^;]\+\).*/\1/g' | cut -f 5,8 | awk '{ if(index($1, ",") == 0) { print("BI",$2) } else { print("MU",$2)}}' | python -c "import sys; import pprint; import collections; pprint.pprint(collections.Counter(sys.stdin))"
Counter({'BI SNV\n': 128537150,
         'BI DIV\n': 11417903,
         'MU SNV\n': 2594069,
         'MU DIV\n': 530554,
         'BI MNV\n': 161035,
         'MU MNV\n': 17156})
ADD COMMENT
0
Entering edit mode

Hi, thanks for this, that's quite reassuring. Do you have a link to where you downloaded the data from? I'd like to try and replicate your numbers above.

ADD REPLY
1
Entering edit mode
2.3 years ago

This is… exactly what you’d expect as more and more people get sequenced?

ADD COMMENT
0
Entering edit mode

Perhaps but given there is a 9 fold increase in the number of SNPs across 144 and 155, a 23 fold increase in non-biallelic SNPs seems high. Plus it's the fact that this isn't documented anywhere and common advice is to remove non-biallelic SNPs before further analysis that makes it worrying (updated question with this note)

ADD REPLY
0
Entering edit mode

Suppose the human population is ~80% A, ~20% G, and ~0.001% T at a SNP. It's quite likely that we'd only have confident observations of A/G as of dbSNP 144, while at least one T will have been confirmed by dbSNP 155.

But yes, non-biallelic SNPs are more challenging to analyze properly. Realistically, you can get away with keeping the non-biallelic SNPs where the third allele is very uncommon (and just treat genotypes containing the third allele as "missing"), but we definitely want better methods and software so that we don't have to resort to that hack. I am working on this.

ADD REPLY
0
Entering edit mode

Makes sense, would like to hear how your work progresses, I'm the creator of the MungeSumstats R package so would be interested in incorporating an approach to this if anything comes of it.

ADD REPLY

Login before adding your answer.

Traffic: 1527 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6