Hi everybody,
I planned to compare reference genome assemblies of two plant varieties (~1.1 Gb genome size) to identify nucleotide sequence divergence (SNPs per kb) and large structural variations (SVs) between them.
Both varieties were downloaded from Phytozome database as unmasked formats (however, they still contain Ns). I initially used mummer4 (nucmer) for the sequence alignment and SNPs calling. Then, I tried to use syri (plotsr) in python environment to find SVs, but it failed to detect structural variations due to "missing information" within the assemblies.
Do the Ns may cause this issue? Perhaps, is there any alternative way to conduct this analysis?
Thanks, Chiara
the Ns in the sequence should not be something to worry about. This is quite common, it denotes gaps in the assembly/sequence : regions of which we know they should be there but can not determine the correct sequence content for it. (== such sequences with Ns in it are also called scaffolds).
I don't think syri is the best option for you, it was designed to compare genome versions of the same specie rather than comparing different strains or varieties. Besides, it has many deficiencies for drafted assemblies (you need chromosome-scale assemblies) and does many assumptions considering you are comparing the same genome. In my opinion nucmer is a good idea that you can complement with sniffles2 (if you have access to the sequencing reads).
Hi both,
I appreciate your advice. I have one more question: since they are draft assemblies, should I use a read simulator before starting the similarity analysis (alignment for SNP calling, structural variation, etc)? When does a read simulator is used?
Thank you,
Chiara