Hello everyone,
I'm a student in the area of genomics.
I have two genome assemblies from long reads (from haploid genomes). One is the reference of the organism (K. phaffii, a yeast), which represents the wild type. The other (the query) is an assembly of an K. phaffii strain, which contains a few genomic modifications, and which was derived from the wild type K. phaffii (the reference). I want to use this data to create a ground truth set of structural variants (SVs) (a file containing the "true" structural variants which are present in the query).
I tried this by running two SV callers, which can take assemblies as their input, SVIMasm and Assemblytics. Additionally, I also employed the tools NucDiff and the MUMmer dnadiff function to get info about the differences between these two assemblies. My idea was that the consensus of those 4 tools will give a confident guess about the "real" structural variants (SVs) inside the query.
However, these four tools heavily disagree and the consensus between them is very limited. I then tried to visualize the alignment between these two assemblies with tools such as IGV and D-Genies, but I was unable to manually find SVs from that comparison.
Therefore my question: How would you approach creating the ground truth in my situation, given that you have these two assemblies of the reference and the query and cannot perform additional laboratory experiments.
I would be very thankful for recommendations,
Kind regards,
Thomas
Thank you very much for your response!
What I tried for now is calling the SVs from the assembly alignment and validating the individual calls by a BLAST search against both the reference and query of the alignment.
Best,
Thomas