I have a VCF (and Beagle file) which contains Genotype likelihoods. I was wondering how to calculate breadth of coverage ("the percentage of target bases that have been sequenced for a given number of times" or how much of the reference genome was sequenced in each individual) for all the individuals across my reference genome.
See (https://www.nature.com/articles/nrg3642)
This line will give the reference genome contig size:
zcat ref.genome.fna.gz | awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' > chr_size.txt
Reading this data in R and preparing it for analysis.
chr.size = readLines("chr_size.txt")
seq.od = seq(1, length(chr.size), by =2)
seq.ev = seq(2, length(chr.size), by =2)
chr.size.df = data.frame(chr = chr.size[seq.od], size = chr.size[seq.ev])
chr.size.df$chr.pos = paste(chr.size.df$chr, chr.size.df$length, sep ="_") # for merging later
So this would give me a file with all the contig length for which I could find if the VCF file contains a position for an individual:
vcftools --gzvcf samples.vcf.gz --extract-FORMAT-info GL --out samples
Note: you could filter using the "--chr" flag and make the file smaller for testing purpose.
This will give me a file with samples.GL.FORMAT with only the genotype likelihoods.
Reading that file in R and processing it so that I get a 1 if there is a sequence at a position and 0 if it is "."
gl.format = read.table("samples.GL.FORMAT.gz",header=T,comment="")
bd.gl = gl.format[,-c(1:2)]
bd.gl.01 = apply(bd.gl, 2, function(x) ifelse(x== ".",0,1))
bd.gl.01.chr.pos = cbind(gl.format[,c(1:2)],bd.gl.01)
Then from there, there might be a way to get breadth of coverage.
I reshaped the data so it's easier to work with (instead of having individuals in columns, I placed them in 1 column knowing if a genotype likelihood was present or absent (pa))
bd.gl.long = bd.gl.01.chr.pos %>%
pivot_longer(cols = -c("CHROM","POS"),
names_to = "ID",
values_to = "pa")
Just taking the sites that are present
bd.gl.long.pa = bd.gl.long[bd.gl.long$pa == 1,]
nrow(bd.gl.long.pa)
bd.gl.long.pa$CHROM = as.factor(bd.gl.long.pa$CHROM)
Since some chromosomes or contigs are too large, I subset the chromosome here.
chr.subset = "chr1"
bd.chrsub = bd.gl.long.pa[bd.gl.long.pa$CHROM == chr.subset,]
Making a variable to merge the 2 datasets.
bd.chrsub$chr.pos = paste(bd.chrsub$CHROM, bd.chrsub$POS, sep ="_")
all.pos = merge(bd.chrsub, chr.size.df, by = "chr.pos", all.x = TRUE)
Plotting the data:
plot(0, type = "n", ylim = c(0,1), main = paste("Breadth of coverage for Chr",chr.subset),
xlim = c(0, chr.size.df[chr.size.df$chr == chr.subset,"size"]))
text(0,.55,paste("Chromosome", chr.subset), pos = 4)
segments(x0 = 0, # showing the FULL chromosome length
y0 = .52,
x1 = chr.size.df[chr.size.df$chr == chr.subset,"size"],
y1 = .52,
lend = 1, lwd = 5)
id.names = unique(all.pos$ID)
pos.y = 1:length(id.names)/length(id.names)/2
for (k in 1:length(id.names)) { # add the points for all the individuals one by one.
points(x = all.pos[all.pos$ID == id.names[k],"POS"],
y = rep(pos.y[k],nrow(all.pos[all.pos$ID == id.names[k],])),
pch = ".",cex = 1)
}
Making tick marks to better situate the individuals:
seq.samples = na.omit(pos.y[seq(0,300, by = 10)])
points(x = rep(0, length(seq.samples)),
y = seq.samples,
pch = "-",cex = 1)
This gives a visual representation of the breadth of coverage. Now these numbers could be used to get details about how much of each contigs was found in each individuals.
chr.size.sub = chr.size.df[chr.size.df$chr == chr.subset,"size"]
breadth.cov = all.pos %>%
group_by(ID) %>%
summarise(sum.pa = sum(pa),
percentage.breadth = sum.pa/chr.size.sub)
hist(breadth.cov$percentage.breadth)
I think this is working, but would like to have a second thought or a cleaner/easier solution.
That'd be cool! The only thing is that my vcf is >210GB and so would R be able to run that?