Entering edit mode
7.6 years ago
kirannbishwa01
★
1.6k
Say I have two multi-sample vcf files from two different populations. I want to find the sites (CHROM, POS and alleles) in these vcf files that are fixed for different alleles, which could be REF vs. ALT
or ALT1 vs. ALT2
.
I was thinking about using pyVCF and comparing all samples to all samples (in Pop1 vs. 2), but it would turn out to be a quite a lengthy and frustrating task.
Any suggestions. Any one done this before or have any proposed solution.
Thanks,
Post Edit:
Lets say we have,
vcf_pop1 =
#CHROM POS FORMAT pop1_A pop1_B pop1_C (I am skipping other headers)
2 26 GT 0/0 0/1 1/2
2 37 GT 1/1 1/1 1/1
2 105 GT 0/0 0/1 0/2
2 206 GT 0/0 0/0 0/0
2 305 GT 0/0 0/1 1/2
2 517 GT 2/2 2/2 2/2
you can see that alleles at position 37, 206 and 517 are fixed in this population
vcf_pop2 =
#CHROM POS FORMAT pop1_A pop1_B pop1_C (I am skipping other headers)
2 29 GT 0/1 1/1 1/2
2 37 GT 0/0 0/0 0/0
2 105 GT 0/0 0/1 0/2
2 119 GT 0/1 1/2 1/2
2 218 GT 2/2 2/2 2/
2 305 GT 0/0 0/1 1/2
2 517 GT 1/1 1/1 1/1
Here alleles at position 37, 218 and 517 are fixed.
But, the two population are alternately fixed at position 37 and 517.
This is not clear to me, can you give an example?
Lets say we have,
But, the two population are alternately fixed at position 37 and 517.
Is it clear now ?
I see.
But aren't the populations also "alternatively fixed" for position 218? It's reference in pop1 (but not in the vcf) and variant in pop2.
@WouterD:
Yes, the allele at position 218 for pop1 could be
reference (0/0) or no call (./.)
. One funny thing about how several VCFcallers work is that: the reference 0/0 is only emited if at least one other sample (in the pool of samples that were called together) had alternate allele/genotype.Since, the original vcf was called using all the samples of both populations (supplying multiple bam files), and then later separated into different vcf by population. Then nocall
GT = ./.
were dropped in each population vcf. This suggests that 218 in pop1 wasnocall (./.)
. But, without explaining what I really did, I know its hard to tell what was what.But, is there a method to called the alternately fixed vcf given this data?
I was thinking if there was a tool to work out this problem (like inside GATK, FreeBayes, SAMTOOLs, etc. etc.) and if the community know about it, before I try to reinvent the wheel.
Btw, did you get the other comment I posted in your other question/answer.
Thanks,
I'm not aware of tool which does this directly (although it's definitely possible it exists).
However, I think writing your own python version wouldn't be too bad. If you would
I had this same thing my mind. But, I am too tired of writing small program for each small problem. I was hoping that such tool already existed, all I need is to find them and put it in my pipeline. I got some response from Perrie, hope it works out.
Thanks again,