Question

Comparing VCF files between two groups (15 vcf files against 15 vcf files)

2

Entering edit mode

5.9 years ago

Pin.Bioinf ▴ 340

Hello,

I have 15 vcf files for one type of population and 15 vcf files for another type. I want to check the differences between the two, and also the similarities. What changes from one group to another and what remains the same, and a signifcance score if possible.

I have read about PLINK but I am not sure how the pipeline should be. Which steps should I folllow? I read the documentation and it is not clear to me.

I also read about bcftools isec: which is useful to intersect multiple vcf files. So I could merge the 15 vcf files between them and the other 15 vcf files between them and end up with two files: population1_variants.vcf and population2_variants.vcf, and then compare those two against eachother and check for the differences and similarities?

Which approach is better? Is this the way people usually analyze variants among populations? How can I asess significance of the results? Are there any other approaches?

Thank you

vcf SNP variants PLINK • 3.1k views

ADD COMMENT • link updated 5.9 years ago by Raony Guimarães ★ 1.4k • written 5.9 years ago by Pin.Bioinf ▴ 340

score 2 · Accepted Answer · 2018-12-21

2

Entering edit mode

5.9 years ago

Raony Guimarães ★ 1.4k

It really depends on what you want to achieve with this comparison. You could merge all VCFs and do an association analysis between the two populations using plink to find differences between the two groups or you could do a PCA using all samples to see if the two populations have a clear separation between them.

Try doing an association analysis:

plink --file mydata --assoc

Look for SNPs with statistical significance between the two groups.

http://zzz.bwh.harvard.edu/plink/anal.shtml

ADD COMMENT • link 5.9 years ago by Raony Guimarães ★ 1.4k

0

Entering edit mode

Thank you! This seems like a nice approach, and what I was looking for. Would the mydata input be the merged 15samplescase.vcf and 15samplescontrol.vcf ? And those vcf merged should contain only the common variations among each of the 15 samples ?

Thank you

ADD REPLY • link 5.9 years ago by Pin.Bioinf ▴ 340