Hi all,
I have a vcf file containing variant data for 52 samples.
- sample 1
- sample 2
- sample 3
- etc., etc.
What I would like to do is perform pairwise comparisons where I count the number of variants (SNPs and small INDELs) between each sample and each other sample.
- number of variants between sample 1 and sample 2
- number of variants between sample 1 and sample 3
- and so on for every pairwise comparison possible.
I'm not looking to count the number of variants across all samples, nor the number of variants between each sample and the reference assembly, as I already have these.
I had been hoping that VCFTools would have a function for this, but from checking the manual, it seems not? If I have missed something in VCFTools, please let me know. Otherwise, I would really appreciate links to python, perl or bash scripts that can do what I need, or recommendations for other software that might help.
Many many thanks in advance.
Thanks very much for the suggestion, I have given it a try and it seems to work well!
Can I just confirm that the number output for each pair is the number of positions where the two samples undergoing comparison have the same genotype, so that more similar samples have higher numbers in the output?
Thanks again!
yes
Hello rc16955 ,
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Sorry to double post. You seem familiar with BioAlcidaeJdk, do you think it could be used for the next step in my analysis? I'd like to be able to see where in the genome variants within pairwise comparisons are falling. Would it be possible, using BioAlcidaeJdk, to get a list of each genomic position (i.e. scaffold + position on that scaffold) that is variant within a particular pairwise comparison? Something like
Obviously the exact format of the output wouldn't be important, as long as it contained the above information. I only include that to give an idea of what I mean
a little bit
bad example , none of your genotypes is the same for each variant.