when polyploid crop plant is sequenced, how do they do data assembly and analysis for haploid genome of that crop ?
when polyploid crop plant is sequenced, how do they do data assembly and analysis for haploid genome of that crop ?
This could take an entire book to explain. I think currently the common approach is to identify and sequence the ancestral genomes, then tackle the polyploid genome, or assemble the genomes in parallel. This has generated good assemblies for canola, cotton and wheat.
Once the genome sequence or reference sequence are available for a species, it is possible to use the fragment of interest or contig of interest (if the sequences are publicly available) and try to use the reference sequence with your .ab1 (chrosomototgram file,if the sequence is derived from Sanger's sequencing method, or fastq file next generation sequencing) and assemble them against the reference sequence using assembly softwares like Codon code (https://www.codoncode.com/) or Genious (http://assets.geneious.com/manual/8.0/GeneiousManualse39.html) and you will get the output as an assembled format. Please note for diploid or polyploid, haploid level sequence is available from databases, you need to findout variant form of sequence/allele from your sequence. Polyploid will act as diploid species (for instance wheat has 7 homeologous groups of chromosomes: For instance group A,B, D chromosome: 1A-7A, 1B-7B, 1D-7D. So in total these homeologous sets can be multiplied by 3 (groups) 7 (Chrosomosomes)=21 (Total homeologous haploid level)2 to have 42 (Total homeologous diploid level chromosomes at diploid level. If the gene has copy with in chromosome, it can be paralogous or if the gene copies present in different homeologous chromosomes, these copies are homeologous or the same copy present in another species are orthologous. Check these servers also: https://urgi.versailles.inra.fr/gb2/gbrowse/wheat_phys_pub/ https://urgi.versailles.inra.fr/blast_iwgsc/?dbgroup=wheat_iwgsc_refseq_v1_chromosomes&program=blastn
Actually working with diploid is slightly easier than polyploids, however eventually polyploids can also act as diploid as I given example above. In lab generally people used to do doubled haploid from haploid set of chromosomes. You can read it for instance: https://www.ncbi.nlm.nih.gov/pubmed/24232908.
But dedublication of genome is not as easy. You may choose gene knockout kind of studies. If you know exactly two haploid sets (Haploid set1+Haploid set2), you may eliminate a single set computationally. But you might not know, which set haploid sets were sequenced. Perhaps you will know if you sequence the genome from haploid sets from tissues like anther in plants to have single set. In case of fungus you have to use spore or hyphae after meiosis phase before they entering to diploid life cycle. Unless you know the information heterozygous or homozyogus nature of the particular gene, it is not possible to deduplicate. Since you may miss the information or allele of interest. One thing, you can try if your sequence (.ab1 file) is showing any double peaks in the gene of your interest, then there may be an alternative form of gene ie an allele and indicates heterozygotic position of the SNP. Let's assume
atg ggg tgg cgg gtt ttg tga allele 1---from Haploide set1
M G W R V L -
atg ggg tgg cgg gtt gtg tga allele 2---from Haploid set 2
M G W R V V -
This is useful: http://www.biol.unt.edu/~jajohnson/Chromatogram_Interpretation
So you have to know which haploid set or sets you have and you have to know which allele is dominant or favorable, accordingly you could directly state a allele of interest for your study instead of going for computational de-duplication.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
*I hope this appropriate for me to comment...
I am working with a non-model fungus and also have this question. This is the first time I have dealt with whole genome assembly and annotation of diploid organisms. I compared assemblies from SPAdes and dipSPAdes (diploid version of the assembler) and found that the diploid assemblies were better. I went forward with the diploid assemblies for gene prediction, but now I am wondering if it is best to de-duplicate the genome prior to gene prediction or if there is a way to do de-duplicate afterwards? Any ideas?