Hi everyone!
I followed the SNPRelate package tutorial in order to filter a VCF by LD and finally pursued PCA. My problem is that by converting the VCF into GDS SNP's and Chromosomes ID's are also converted into a different nomenclature.
sample.id, a unique identifier for each sample.
snp.id, a unique identifier for each SNP.
How can I get the original SNP ID's that are kept after LD pruning?
Thanks in advance.
setwd("~/Desktop/snprelate")
library(gdsfmt)
library(SNPRelate)
cap.vcf <- "all_148.vcf"
snpgdsVCF2GDS(cap.vcf, "cap.gds", method = "biallelic.only")
snpgdsSummary("cap.gds")
genofile <- snpgdsOpen("cap.gds")
pop_code <- scan("pop.txt", sep = "\t", what = list("pop.txt"))
pop_code
set.seed(1000)
cap.LD <- snpgdsLDpruning(genofile, remove.monosnp = TRUE, ld.threshold = .1, maf = .1, missing.rate = 0.1, method = "composite", slide.max.bp = 500000, verbose = TRUE)
snp_id_list <- unlist(cap.LD)
Hi there,
I'm also having trouble finding which SNPs have been pruned after LD pruning. I want to be able to list those loci that are being removed so I can remove them from my dataset for future analyses. I'll be honest that I am very new to this stuff!
Here is my code (forgive me for any errors!):
library(gdsfmt) library(SNPRelate)
File Conversion - VCF to GDS
vcf.fn <- "H:PhD Research/UNEAK_Bat_West1+3+4/Filtered Data/CORA 50%.vcf" snpgdsVCF2GDS(vcf.fn, "CORA.gds", method="biallelic.only") snpgdsSummary("CORA.gds")
setwd("H:/PhD Research/UNEAK_Bat_West1+3+4/SNPRelate/")
Clean-up Linkage Disequilibrium
CORA.gds <- snpgdsOpen("CORA.gds", readonly=FALSE, allow.duplicate=TRUE, allow.fork=FALSE) set.seed(1000) snpset <- snpgdsLDpruning(CORA.gds, ld.threshold=0.8) names(snpset) head(snpset$chr0) snp.id <- unlist(snpset) read.gdsn(index.gdsn(CORA.gds, "snpset"))
I'm trying to read the GDS field in the last line but can't seem to get an output. Instead I get an error: Error in inherits(node, "gdsn.class") : No such GDS node "snpset"!
I'm also trying to calculate HWE equilibrium and have come across very little documentation on how to do this, but here is my code:
Clean-up HWE
CORA.gds <- snpgdsOpen("CORA.gds", readonly=FALSE, allow.duplicate=FALSE, allow.fork=FALSE) sample.id <- read.gdsn(index.gdsn(CORA.gds, "sample.id")) p <- snpgdsHWE(CORA.gds, sample.id=NULL, snp.id=NULL, with.id=TRUE) summary(p) plot(-log10((1:length(p))/length(p)), -log10(p[order(p)]), xlab="-log10(expected P)", ylab="-log10(observed P)", main="QQ Plot")
Any advice at all on any of these aspects would be much appreciated!
Please post as a new question. In doing so, please highlight your code (with mouse) and select the
101 010
button to format it as code. This helps to improve visualisation.Done. Thank you for your help, Kevin.