We got our individuals (F1s, from crossing between reference genome and objective one) of a plant species sequenced by NGS method. Variants (snps and indels) were called for each of objective plant by these F1 individuals NGS data. Our data are haplotype data, a phased haplotype was called for one objective plant (one parent of the F1 individual).
For quality control, we employed several values: (1) concordance, (ratio of reads supporting a predicted feature to total coverage); (2) coverage, (how many reads supported this variant); (3) base quality, (base quality from the sequencing process).
Here, concordance may be the most important variable for quality control. The best variant calls determined by concordance are those have values of 0.5. Obviously, smaller ones (<0.1) and bigger ones (>0.9) are not good. Coverage may also play an important role, like the calls which have 0.5 vale for concordance and 0.1 coverage may not be the good calls. While, base quality may be the most intuitive quality control variable. The bigger base quality should be the calls which are better.
Here, I want to find a good strategy to set a cutoff to our variant calls based on these three or just concordance and coverage variables. I prefer a more statistical way.
Would you please give me any ideas/directions on my problems? Thanks in advance.
You want to call variants or haplotypes? Your definition of concordance is a bit confusing, in this context it usually means similarity to known calls.
I want to call haplotypes and the variants. I mean I can get variants and haplotype in the same time.