Entering edit mode
14 months ago
MaeBH
•
0
Hi everyone, very new to bioinformatics
I have a SNP datasheet (.vcf) that I tried to make a PCA graph with using Rstudio (vcfR package) and it gave me interesting clustering. I then tried to filter the dataset using vcftools in Linux terminal (missing data: 0.80; MAF=0.05) and the PCA still has weird straight arms and low % variance explained
Just wondering a couple things:
- what might be causing this?
- how to get rid of it?
- would this affect downstream analysis?
Any help would be greatly appreciated
Could you better describe your data? How many variants would you left with if you remove all the missing sites?
Is this sequencing data? In that case you should also filter for read depths and genotype qualities.
Hi thank you both for your responses :)
Yes, this is sequencing data (de novo ddRADseq). The data is from diploid plant material sampled across a landscape. Each population is a family comprised of a maternal plant and her progeny
Initial attributes: (shown in PCA above)
Have now filtered the data using the following criteria using vcftools: (Missing data: 0.80 ;MAF-0.5 ;minGQ: 0.9 ;minDP: 10)
It seems like an improvement but variant count dropped drastically :? What is making it drop so much?
I feel like it may be the percent of missingness within many of the samples being pretty high. Haven't been able to make a PCA with the reduced number of samples (it's now saying the vector numbers are incorrect)
I agree with the other commenter - we need more information. Especially a rough idea of the number of variants before/after filtering. Also, what is the relationship between population? What is the depth of the sequencing?
Given most SNP tools only emit information about variant sites, this could just be indicative that there is some unique variation to each population. However, given the low PC inertia I would guess most variable sites are either quite variable both within and among populations, or there are many variants unique to only a few individuals (and not in the same population and this would increase PC inertia).
What would be the best way to test for how variable sites are and or the number of unique variants? Sorry if the questions are really basic, still trying to wrap my head around everything
I can't think of any tools off the top of my head, but I'm sure they exist. If you were to do it manually, you could part the VCF's genotype field and annotate which population and how many individuals. With a decent grasp of R and knowledge of what the GT field denotes you shouldn't have a problem doing this. I haven't used ddRADseq so I don't know about whether if it's appropriate, but a structure barplot would be another way of visualizing the distribution of SNPs. See this tutorial.
Overall, I think your data generally looks okay. Hard to tell with the colours, but your populations are generally clustering together.