I am attempting to merge the exome and genome VCFs from gnomAD. I have successfully paired down to the gene I am interested in, subset to the INFO fields I need, and now I am trying to merge.
I would like to sum all of the "AC", "AN" and "GC" INFO fields, including all of the population-specifc fields. Currently I am using the following command:
bcftools merge -m all -i AC:sum,AN:sum,AC_AFR:sum,AC_AMR:sum,AC_ASJ:sum,AC_EAS:sum,AC_FIN:sum,AC_NFE:sum,AC_OTH:sum,AC_SAS:sum,AN_AFR:sum,AN_AMR:sum,AN_ASJ:sum,AN_EAS:sum,AN_FIN:sum,AN_NFE:sum,AN_OTH:sum,AN_SAS:sum,GC_AFR:sum,GC_AMR:sum,GC_ASJ:sum,GC_EAS:sum,GC_FIN:sum,GC_NFE:sum,GC_OTH:sum,GC_SAS:sum,AC_raw:sum,AN_raw:sum,GC_raw:sum,GC:sum in1.vcf.bgz in2.vcf.bgz -o out.vcf
I can confirm that both input VCFs have the "AC" and "AN" fields, but after the merge, "AC" and "AN" are missing. All of the population-specific fields, eg. AC_AFR, are present. "GC" is also present. Why is the merge dropping "AC" and "AN"?
I am using: bcftools 1.9 Using htslib 1.9
The complete script:
library(Rsamtools)
rd <- file.path("inst", "raw")
dd <- file.path("data")
if (!dir.exists(dirname(rd))) dir.create(dirname(rd))
if (!dir.exists(rd)) dir.create(rd)
if (!dir.exists(dd)) dir.create(dd)
l <- "gs://gnomad-public/release/2.0.2/vcf/"
e1 <- file.path(rd, "gnomad.exomes.r2.0.2.sites.vcf.bgz")
e2 <- file.path(rd, "gnomad.exomes.r2.0.2.sites.vcf.bgz.tbi")
g1 <- file.path(rd, "gnomad.genomes.r2.0.2.sites.chr2.vcf.bgz")
g2 <- file.path(rd, "gnomad.genomes.r2.0.2.sites.chr2.vcf.bgz.tbi")
getFile <- function(f, l) {
if (file.exists(f)) return(TRUE)
sub <- if (grepl("genomes", f)) "genomes" else "exomes"
s <- sprintf("gsutil -m cp %s%s/%s %s", l, sub, basename(f), f)
system(s)
TRUE
}
getFile(e1, l)
getFile(e2, l)
getFile(g1, l)
getFile(g2, l)
# tpo GRCh37 coordinates (https://www.ncbi.nlm.nih.gov/gene/7173)
tpo <- "2:1417233-1547445"
## Subset down to allele and genotype count fields
infoFlds <- rownames(scanBcfHeader(e1)[[1]]$Header$INFO)
keep <- infoFlds[grep("^AC|^AN|^GC", infoFlds)]
keep <- keep[!grepl("POPMAX|Male|Female", keep)]
keepStr <- paste(paste0("INFO/", c(keep, "AS_FilterStatus")), collapse = ",")
e3 <- sub("r2.0.2.sites", "tpo", e1)
g3 <- sub("r2.0.2.sites.chr2", "tpo", g1)
fmt1 <- "bcftools annotate -r %s -x ^%s %s | bgzip -c > %s"
system(sprintf(fmt1, tpo, keepStr, e1, e3))
system(sprintf(fmt1, tpo, keepStr, g1, g3))
system(sprintf("bcftools index -t %s", e3))
system(sprintf("bcftools index -t %s", g3))
mrg <- paste(paste0(keep, ":sum"), collapse = ",")
c1 <- file.path(rd, "gnomad.tpo.vcf")
system(sprintf("bcftools merge -m all -i %s %s %s -o %s", mrg, e3, g3, c1))
The following loads the VCFs, and displays the INFO fields contained in the files. The genome data does not contain any South Asian samples.
eDat <- scanBcf(e3)
gDat <- scanBcf(g3)
cDat <- scanBcf(c1)
getInfoFlds <- function(dat) {
sapply(strsplit(strsplit(dat$INFO[[1]], ";")[[1]], "="), "[[", 1)
}
getInfoFlds(eDat)
# [1] "AC" "AN" "AC_AFR" "AC_AMR"
# [5] "AC_ASJ" "AC_EAS" "AC_FIN" "AC_NFE"
# [9] "AC_OTH" "AC_SAS" "AN_AFR" "AN_AMR"
# [13] "AN_ASJ" "AN_EAS" "AN_FIN" "AN_NFE"
# [17] "AN_OTH" "AN_SAS" "GC_AFR" "GC_AMR"
# [21] "GC_ASJ" "GC_EAS" "GC_FIN" "GC_NFE"
# [25] "GC_OTH" "GC_SAS" "AC_raw" "AN_raw"
# [29] "GC_raw" "GC" "AS_FilterStatus"
getInfoFlds(gDat)
# [1] "AC" "AN" "GC_raw" "AC_raw"
# [5] "AN_raw" "GC" "AC_AFR" "AC_AMR"
# [9] "AC_ASJ" "AC_EAS" "AC_FIN" "AC_NFE"
# [13] "AC_OTH" "AN_AFR" "AN_AMR" "AN_ASJ"
# [17] "AN_EAS" "AN_FIN" "AN_NFE" "AN_OTH"
# [21] "AS_FilterStatus" "GC_AFR" "GC_AMR" "GC_ASJ"
# [25] "GC_EAS" "GC_FIN" "GC_NFE" "GC_OTH"
getInfoFlds(cDat)
# [1] "AC_AFR" "AC_AMR" "AC_ASJ" "AC_EAS"
# [5] "AC_FIN" "AC_NFE" "AC_OTH" "AC_raw"
# [9] "AN_AFR" "AN_AMR" "AN_ASJ" "AN_EAS"
# [13] "AN_FIN" "AN_NFE" "AN_OTH" "AN_raw"
# [17] "GC" "GC_AFR" "GC_AMR" "GC_ASJ"
# [21] "GC_EAS" "GC_FIN" "GC_NFE" "GC_OTH"
# [25] "GC_raw" "AS_FilterStatus"
We can also show that each entry in the input VCFs have the AC field:
system(sprintf("bcftools view %s | grep AC= | wc -l", e3))
# 1609
system(sprintf("bcftools view %s | grep -v '#' | wc -l", e3))
# 1609
system(sprintf("bcftools view %s | grep AC= | wc -l", g3))
# 16108
system(sprintf("bcftools view %s | grep -v '#' | wc -l", g3))
# 16108
system(sprintf("bcftools view %s | grep AC= | wc -l", c1))
# 0
The South Asian (*_SAS
) fields are also being lost, which is problematic:
setdiff(getInfoFlds(eDat), getInfoFlds(gDat))
# [1] "AC_SAS" "AN_SAS" "GC_SAS"
setdiff(getInfoFlds(eDat), getInfoFlds(cDat))
# [1] "AC" "AN" "AC_SAS" "AN_SAS" "GC_SAS"
It makes more sense they are lost, because they are not present in the genome set. However, I still need to pull that information along.
What happens if you omit the
AC:sum,AN:sum
part? Try it on the first 10 VCF entries only, so you can eyeball the results' veracity.They are still omitted if I remove
AC:sum,AN:sum
. From the documentation it seems like only DP and DP4 have default behavior. It is odd that they are completely dropped though, rather than taking the value from the first file.Can you edit your question and add the version of
bcftools
andhtslib
you're using please? Just paste the output ofbcftools -v
Done. Happy to upload the entire script, but it requires downloading the (somewhat) large gnomAD VCFs.
No, don't worry about it. I've worked a LOT on bcftools + gnomAD VCF files, so I am familiar with the setting.
You mention gnomAD genome and exome datasets, so I wonder if these are different samples (appropriate for
bcftools merge
) or the same set of samples (need a different tool).Each file, in my understanding, has one "sample" -- which I am trying to merge together to get allele frequencies combining the genome and exome data (similar to what gnomAD displays for the browser). For example, here you can toggle including genome and exome data, which just sums over the allele count and allele number.
And the files
e3
andg3
have theAC
andAN
fields? Just so we're suremerge
is the one removing them.Yes. I added some clarifying output to the post.
I'm kinda busy so I'm unable to pay this as much attention as I'd like to. Wild suggestion: Try running bcftools from the command line, and quote all input arguments, especially the one passed to
-i
.No luck. I appreciate your attempts to help. I also tried manually adding the
SAS
fields to the genome VCF before the merge, with no success. I thought it might exclude the "overall" values when there were subset values missing.Maybe give CombineVariants a shot?
Removing the
_SAS
fields did not fix the problem. I thought maybe theAN
andAC
were removed because theSAS
fields didn't carry over from the exome file.