We have noted a substantial increase the proportion of non-biallelic SNPs across dbSNP 144 and 155 (increased from ~1% to 24%). This is explained in detail here and summarised in the code below.
Note that this is based on the dbSNP versions in Bioconductor (SNPlocs.Hsapiens.dbSNP) available in the devel version of Bioconductor, so we would like any confirmation that this aligns with dbSNP versions outside of Bioconductor but there doesn't seem to be any numbers available online about it? The only piece of work about this is quite old now and doesn't give specific numbers SNPs that come in threes. It would be great if anyone has any insight on this?
This is a little unnerving given (from what I could find anyway) a lack of documentation around these numbers and given common advice is to remove non-biallelic SNPs before further analysis (for example, here)
Also posted here: https://github.com/hpages/SNPlocsForge/issues/1
Code to replicate our findings:
#wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144.txt.gz"
#gunzip snp144.txt.gz
snps <- limma::read.columns("~/Downloads/snp144.txt",
required.col=c("rs775809821"))$rs775809821
#missing header so add in SNP
snps <- c("rs775809821",snps)
library(BSgenome)
library(data.table)
extractAlleles <- function(rsids, dbSNP, ref_genome="GRCh38")
{
pkgname <- paste0("SNPlocs.Hsapiens.dbSNP", dbSNP, ".", ref_genome)
pkgenv <- loadNamespace(pkgname)
snplocs <- get(pkgname, envir=pkgenv, inherits=FALSE)
snps <- snpsById(snplocs, rsids, ifnotfound="drop")
unname(setNames(IUPAC_CODE_MAP[mcols(snps)$alleles_as_ambig],
mcols(snps)$RefSNP_id)[rsids])
}
dbSNP144_alleles <- extractAlleles(snps, 144)
dbSNP155_alleles <- extractAlleles(snps, 155)
alleles <- data.frame(rsid=snps, dbSNP144_alleles, dbSNP155_alleles)
alleles <- data.table::setDT(alleles)
#remove where alleles missing for either
alleles <- alleles[!(is.na(dbSNP144_alleles) | is.na(dbSNP155_alleles))]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)==2,
match:="both biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)>2,
match:="both non-biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)==2,
match:="dbSNP 144 non-biallelic"]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)>2,
match:="dbSNP 155 non-biallelic"]
summ <- alleles[,.N,by=match]
summ[,prop:=N/sum(summ$N)]
#> summ
#match N prop
#1: both biallelic 100605258 0.7427438640
#2: dbSNP 155 non-biallelic 32266661 0.2382168183
#3: both non-biallelic 2565775 0.0189424855
#4: dbSNP 144 non-biallelic 13116 0.0000968322
And just to visually represent them:
library(ggplot2)
library(cowplot)
pastel_cols <- c("#9A8822","#F5CDB4","#F8AFA8",
"#FDDDA0","#74A089","#85D4E3",
#added extra to make 7
'#78A2CC')
ggplot(summ, aes(x = factor(match,levels=rev(c("both biallelic",
"both non-biallelic",
"dbSNP 155 non-biallelic",
"dbSNP 144 non-biallelic"))), y = N))+
geom_bar(stat = "identity", fill = pastel_cols[c(2,3,2,2)])+
geom_text(label = with(summ, paste(prettyNum(N,big.mark=",",scientific=FALSE),
paste0('\n (', round(prop*100,2), '%)'))),
hjust = -.3)+
ylim(0,max(summ$N)+20000000)+
coord_flip() +
theme_cowplot()+
theme(axis.title.y=element_blank())
Hi, thanks for this, that's quite reassuring. Do you have a link to where you downloaded the data from? I'd like to try and replicate your numbers above.
Current: https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/
V151: https://ftp.ncbi.nlm.nih.gov/snp/pre_build152/organisms/human_9606_b151_GRCh37p13/VCF/
V144: https://ftp.ncbi.nlm.nih.gov/snp/pre_build152/organisms/archive/human_9606_b144_GRCh38p2/VCF/