My first thought about this: pooling 30 into one is surely an economical solution, but also a bad idea for data analysis. What is good if you save 80% of money by pooling, but waste the rest 20 % because the data is not reliable?
Well, let's hope that does not happen.
I think the first question you have to answer is how to decide a variant is found for each pool. Is it a real variant (i.e. SNP/Indel) or an artifact? What does it mean if a variant has coverage < 30 (number of individual).
Some primary quality check may need even to be done before variant calling to see how the raw data looks like. Did you do this?
Now suppose you have the best variants you can get in VCF file format. You can use this tool to answer your question:
http://vcftools.sourceforge.net/man_latest.html. Or maybe better suite in your case: http://vcftools.sourceforge.net/perl_module.html#vcf-compare
COMPARISON OPTIONS
These options are used to compare the original variant file to another variant file and output the results. All of the diff functions require both files to contain the same chromosomes and that the files be sorted in the same order. If one of the files contains chromosomes that the other file does not, use the --not-chr filter to remove them from the analysis.
DIFF VCF FILE
--diff <filename>
--gzdiff <filename>
--diff-bcf <filename>
These options compare the original input file to this specified VCF, gzipped VCF, or BCF file. These options must be specified with one additional option described below in order to specify what type of comparison is to be performed. See the examples section for typical usage.
DIFF OPTIONS
--diff-site
Outputs the sites that are common / unique to each file. The output file has the suffix ".diff.sites_in_files".
--diff-indv
Outputs the individuals that are common / unique to each file. The output file has the suffix ".diff.indv_in_files".
--diff-site-discordance
This option calculates discordance on a site by site basis. The resulting output file has the suffix ".diff.sites".
--diff-indv-discordance
This option calculates discordance on a per-individual basis. The resulting output file has the suffix ".diff.indv".
--diff-indv-map <filename>
This option allows the user to specify a mapping of individual IDs in the second file to those in the first file. The program expects the file to contain a tab-delimited line containing an individual’s name in file one followed by that same individual’s name in file two with one mapping per line.
--diff-discordance-matrix
This option calculates a discordance matrix. This option only works with bi-allelic loci with matching alleles that are present in both files. The resulting output file has the suffix ".diff.discordance.matrix".
--diff-switch-error
This option calculates phasing errors (specifically "switch errors"). This option creates an output file describing switch errors found between sites, with suffix ".diff.switch".
Thank you for the answer. I know that using pools of individuals we lost a lot of informations but as you have pointed out it was for economic reasons. I tried to apply very strong filter: Q>=100-DP>=100 as we are interested in very few SNP fixed (private allele) in each breed to traceability purpose. So after the filter I used the
vcf-isec
command to find SNPs that are inA.vcf
but not inB.vcf
and inC.vcf
:Doing in this way Can I be sure that I reach my target?
thank you in advance
Marco