How to create structural variants ground truth for alignment of two long-read genome assemblies?
1
2
Entering edit mode
15 months ago
Thomas ▴ 40

Hello everyone,

I'm a student in the area of genomics.

I have two genome assemblies from long reads (from haploid genomes). One is the reference of the organism (K. phaffii, a yeast), which represents the wild type. The other (the query) is an assembly of an K. phaffii strain, which contains a few genomic modifications, and which was derived from the wild type K. phaffii (the reference). I want to use this data to create a ground truth set of structural variants (SVs) (a file containing the "true" structural variants which are present in the query).

I tried this by running two SV callers, which can take assemblies as their input, SVIMasm and Assemblytics. Additionally, I also employed the tools NucDiff and the MUMmer dnadiff function to get info about the differences between these two assemblies. My idea was that the consensus of those 4 tools will give a confident guess about the "real" structural variants (SVs) inside the query.

However, these four tools heavily disagree and the consensus between them is very limited. I then tried to visualize the alignment between these two assemblies with tools such as IGV and D-Genies, but I was unable to manually find SVs from that comparison.

Therefore my question: How would you approach creating the ground truth in my situation, given that you have these two assemblies of the reference and the query and cannot perform additional laboratory experiments.

I would be very thankful for recommendations,

Kind regards,

Thomas

yeast assembly structural-variation SV-callers • 1.0k views
ADD COMMENT
2
Entering edit mode
15 months ago
Christophe ▴ 20

Hi,

D-genies is using minimap2 to align both genomes and minimap2 is chaining local alignments to produce a global one. If the SV are small or medium size insertion or deletion it is possible that they will be lost in the chains. You can change this behavior with the -g parameter https://lh3.github.io/minimap2/minimap2.html

Another solution is to align the long reads of you second assembly, if you have access to them, on the reference and call SVs from the alignment. There are several SV callers including SVIM and PBSV which have been used with success in some of our projects. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02551-4

Cheers,

Christophe

ADD COMMENT
2
Entering edit mode

Thank you very much for your response!

What I tried for now is calling the SVs from the assembly alignment and validating the individual calls by a BLAST search against both the reference and query of the alignment.

Best,

Thomas

ADD REPLY

Login before adding your answer.

Traffic: 1880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6