Entering edit mode
21 months ago
Dardo
•
0
Hello, my problem is that I urgently need to select 30 chickpea lines that maximize diversity for genes related to symbiosis with rhizobia. I have 531 lines and a filtered VCF file containing the variants that appear for each position of each chromosome in relation to the reference genome. My question is: how can I obtain 30 lines that have variability in those genes, knowing their positions and chromosomes, from the VCF file containing the variants for the 531 lines?
First, annotate your VCF file using VariantEffectPredictor or SnpEff with the gtf for the chickpea reference, which will place gene annotations and mutation consequences on each of the variants. Second, subset the VCF to mutations in your genes of interest (rhizobia symbiosis genes, which I assume you can pull from literature) that modify the protein sequences. Finally, you can convert the genotypes into reference and non-reference counts (0, 1, 2 assuming diploid); normalize the counts by
(x-u)/sqrt(2*u*(1-u))
whereu
is the allele frequency of the variant, and use PCA to visualize the variation between strains.