I have identified SNPs in 32 resequenced samples relative to the same reference genome. The output is in VCF format, using the mpileup method. I would like to efficiently remove SNPs that are present in all 32 of the samples as they are likely to be present due to differences between the reference and the resequenced samples.
I can work out a slow and probably inefficient method in PERL, but i was wondering whether anyone has tackled this sort of question or even whether such a task could be accomplished within samtools?
CLARIFICATION: The 32 samples are separate strains, i.e. may be expected to contain novel mutations relative to the reference. They are not 32 samples of the same strain, therefore when mpileup is run with all samples together i end up with very few SNPs. This is why i ran mpileup 1 sample at a time as i want to find novel SNPs between strains
Thanks.
just out of curiosity, you have got 1 vcf file containing all the results, right? or do you have one file per sample? I think I didn't understand correctly how your resequencing experiment worked.
There were 32 separate VCF files. I have now then all together, as suggested by lh3