Hello!
I am new to studying maize genomics and would like to map a dataset of mutations to gene names.
The dataset is from Genomes2Fields, where for each sample we get an identifier for the maize type (e.g. 2369/DK3IIH6) and whether or not there was a mutation at a specific location (e.g. S1_162464).
My question is, how do I find the gene name on which a mutation, e.g. S1_162464 finds itself?
I have been looking through the PHG and rPHG for hours now and am still confused.
Any help would be greatly appreciated. Thank you in advance!
Ah I see, thank you pjb39!
pjb39 for clarification, so does for example S1_162464 in the PHG correspond to chromosome 1 position 162464 of B73 in jbrowse.maizegdb.org?
Specifically where the annotations are from Zm-B73-REFERENCE-NAM-5.0?
I am guessing this is the case considering the "Assembly coordinates map linearly to reference coordinates" figures from Bradbury et al., 2022 supplementary material, but wanted to make 100% sure. Thanks!
It isn't clear what you're asking . The PHG stores haplotypes, each of which are specified by a reference range ID that provides coordinates relative to the reference fasta for that range. The reference range ID refers to a data object that includes a chromosome, start and end position (1-based) for the haplotype sequence. This data is relative to the reference file used to create the PHG.
THere are haplotypes for both the reference and non-reference genomes in the db.
For haplotypes created from assemblies (vs WGS data), there is an identifier in the haplotypes table which allows for determining the assembly genome, as well as the assembly contig, start and end position where this sequence can be found in the assembly fasta file.
If you are using a maize PHG created with Zm-B73-REFERENCE-NAM-5.0.fa as the reference, the sequence for the reference haplotypes should match to the Zm-B73-REFERENCE-NAM-5.0.fa at the specified chrom, start/end positions. The sequence for non-reference haplotypes may be loaded to a program e.g. IGV for comparison to the reference data at the same locations.
Please clarify if this does not answer you question.
Apologies lcj34 for the lack of clarity and thank you for your answer.
I am trying to perform genotype-phenotype prediction using the genomes2fields data preprocessed by Lopez-Cruz et al 2023 (https://doi.org/10.1038/s41467-023-42687-4).
They write: "DNA genotypes were derived from a common set of 437,214 SNPs available from the Practical Haplotype Graph (PHG) platform45..." I am not sure which reference genome was used. I have now emailed the corresponding authors to confirm this.
Ok but assuming the PHG was created with Zm-B73-REFERENCE-NAM-5.0.fa as the reference.
In the public dataset of the above paper, I don't see all of the things you mentions such as the chrom start/end positions. The genotype data consists of around 100,000 columns with names such as S1_<Numbers>, S2_<Numbers>, ... , S10_<Numbers>. I understand that these are haplotypes, but what I'd like to know is how do I figure out where these are in the reference genome. Is it simply: S1 = Chromosome 1
<Numbers> = location on Chromosome 1 of the reference genome?
So let's say S5_163897270 comes out as being particularly predictive for yield. I would like to determine if that haplotype is part of a specific gene in the reference genome.
Can I take the Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3 and look for a gene that has a start position less than 163897270 and end position greater than 163897270?
Perhaps this boils down to how the reference range ID (S1_<Numbers>?) maps to the reference genome.
Hi dylan - I think the confusion is related to what is PHG output and what is post-processing of PHG data. The output the authors derived from the PHG was most likely a mutli-sample VCF file. As you can see from the paragraph starting with the sentence you reference above, the authors ran various BCF/VCF tools on this data to obtain/create their genotypes.
The S1_<Numbers> data would be a result of post-processing the PHG VCF - it did not come from the PHG. Our reference range IDs are merely integers that reference a specific row in another database table. While the name most likely refers to a position on a specific chromosome you should check with the authors for verification.
In terms of determining if a specific chromosome/position falls within a gene, you could use a tool e.g. Broad Institute's IGV tool. For this tool you would load the reference genome along with the gene annotation file (GFF) for that reference. Then navigate to your chrom/position and look at the annotation tracks to see if this falls within a gene region.