Entering edit mode
25 days ago
SIMONE
•
0
Dear all,
i am working on 500 WES samples coming from two different sources and i would like to check if there is any batch effect. I ma trying to perform PCA because i think it is the easiest way to check that but i don't know which data i should use. Should i use quality data from fatstqc or bam file or should i use data from vcf file? There is an R package or something i can use ? I have used VarScan2 to call SNPs.
Thank in advance.
You should convert VCF data to genotypes (0/1/2) for each SNV. Then remove sites where all samples are same, then feed that data into princomp(). When making the PCA, you can colour the points by the batch. Don't just look at the top 2 PCAs. Use a screeplot to see how many PCs you need to encapsulate 90-95% of the variation among the samples. It may be that you need to look at 5-10 PCs. Unfortunately I am not aware of a package that can do all of these steps.
Hi,
Thanks for the comment, i have another question. Would it make sense to check some quality metrics obtained through CollectHsMetrics or CollectAlignmentSummaryMetrics?
Best.
Collecting QC information is always a good idea. Fastqc can tell you some things about the sequencing run and library, but others can only be obtained from the BAM files like %unique reads and %reads mapped to the genome.