Question

Comparing reference genome assemblies of two plant varieties

1

Entering edit mode

2.3 years ago

Chiara ▴ 10

Hi everybody,

I planned to compare reference genome assemblies of two plant varieties (~1.1 Gb genome size) to identify nucleotide sequence divergence (SNPs per kb) and large structural variations (SVs) between them.

Both varieties were downloaded from Phytozome database as unmasked formats (however, they still contain Ns). I initially used mummer4 (nucmer) for the sequence alignment and SNPs calling. Then, I tried to use syri (plotsr) in python environment to find SVs, but it failed to detect structural variations due to "missing information" within the assemblies.

Do the Ns may cause this issue? Perhaps, is there any alternative way to conduct this analysis?

Thanks, Chiara

Plant SNPs comparisons genome SVs • 1.3k views

ADD COMMENT • link updated 2.2 years ago by colindaven 7.0k • written 2.3 years ago by Chiara ▴ 10

1

Entering edit mode

the Ns in the sequence should not be something to worry about. This is quite common, it denotes gaps in the assembly/sequence : regions of which we know they should be there but can not determine the correct sequence content for it. (== such sequences with Ns in it are also called scaffolds).

ADD REPLY • link 2.3 years ago by lieven.sterck 15k

1

Entering edit mode

I don't think syri is the best option for you, it was designed to compare genome versions of the same specie rather than comparing different strains or varieties. Besides, it has many deficiencies for drafted assemblies (you need chromosome-scale assemblies) and does many assumptions considering you are comparing the same genome. In my opinion nucmer is a good idea that you can complement with sniffles2 (if you have access to the sequencing reads).

ADD REPLY • link 2.3 years ago by Buffo ★ 2.4k

0

Entering edit mode

Hi both,

I appreciate your advice. I have one more question: since they are draft assemblies, should I use a read simulator before starting the similarity analysis (alignment for SNP calling, structural variation, etc)? When does a read simulator is used?

Thank you,

Chiara

ADD REPLY • link 2.2 years ago by Chiara ▴ 10

score 0 · Answer 1 · 2022-09-01

There are other options.

One is to compare them chromosome by chromosome only (assuming the chromosomes are homeologous) using PGGB. This will get you a vcf of SVs as well. https://github.com/pangenome/pggb/issues

Another is to use JBrowse2 and the tutorial on synteny to try to map and browse the differences between the two. Also makes same assumption as above to be be usable, I guess. https://jbrowse.org/jb2/docs/superquickstart_web/#load-a-synteny-track