Hi there,
I'm having trouble finding which SNPs have been pruned after LD pruning. I want to be able to list those loci that are being removed so I can remove them from my dataset for future analyses. I'll be honest that I am very new to this stuff!
Here is my code (forgive me for any errors!):
library(gdsfmt)
library(SNPRelate)
### File Conversion - VCF to GDS ###
vcf.fn <- "H:PhD Research/UNEAK_Bat_West1+3+4/Filtered Data/CORA 50%.vcf" snpgdsVCF2GDS(vcf.fn, "CORA.gds", method="biallelic.only")
snpgdsSummary("CORA.gds")
setwd("H:/PhD Research/UNEAK_Bat_West1+3+4/SNPRelate/")
### Clean-up Linkage Disequilibrium ###
CORA.gds <- snpgdsOpen("CORA.gds", readonly=FALSE, allow.duplicate=TRUE, allow.fork=FALSE)
set.seed(1000)
snpset <- snpgdsLDpruning(CORA.gds, ld.threshold=0.8)
names(snpset) head(snpset$chr0)
snp.id <- unlist(snpset)
read.gdsn(index.gdsn(CORA.gds, "snpset"))
I'm trying to read the GDS field in the last line but can't seem to get an output. Instead I get an error:
Error in inherits(node, "gdsn.class") : No such GDS node "snpset"!
I'm also trying to calculate HWE equilibrium and have come across very little documentation on how to do this, but here is my code:
### Clean-up HWE ###
CORA.gds <- snpgdsOpen("CORA.gds", readonly=FALSE, allow.duplicate=FALSE, allow.fork=FALSE)
sample.id <- read.gdsn(index.gdsn(CORA.gds, "sample.id"))
p <- snpgdsHWE(CORA.gds, sample.id=NULL, snp.id=NULL, with.id=TRUE)
summary(p)
plot(-log10((1:length(p))/length(p)), -log10(p[order(p)]), xlab="-log10(expected P)", ylab="-log10(observed P)", main="QQ Plot")
Any advice at all on any of these aspects would be much appreciated!
Hey, thank you for re-posting here. Just for the LD part, I think that you may in fact want to select
ld.threshold=0.2
- 0.8 seems too high? This may not change anything, though.How did you build this tutorial so far? - following this: http://corearray.sourceforge.net/tutorials/SNPRelate/ ?
The packages loaded correctly?
Have you performed any pre-processing on your VCF?; Did both
snpgdsVCF2GDS
andsnpgdsSummary
run okay?Hi Kevin,
Thanks for your response. Yes, I was following the tutorial you mention to build my code, but it doesn't mention how to visualize the markers once they have been pruned.
All of the packages loaded correctly and all of the functions worked. It tells me I have 2,745 markers selected from my original 3,569, but I'm just not sure how to visualize them.How do I know which markers have been pruned so I can remove them from my dataset and use the cleaned set for future analyses? Is there a way to write the .gds to a .csv or something similar? Also, I did not perform any pre-processing on my VCF. I took it just as it is and converted it to .gds. Let me know your thoughts.
Why not just get the difference between the list of 3569 markers and that of 2745? I have not used SNPRelate. I use PLINK (recommended by 15993216250) for this type of analysis.
Yes, I am trying to get the difference between the two lists but I'm not sure how to visualize that. How do I see which markers were removed from the list of 3569 to make it 2745? How can I get an output? If I can't figure out SNPRelate, I am going to give PLINK a try. Thank you for your input!