Hello Biostars,
I am doing whole genome alignment using NUCmer (a program under MUMmer). I am using this alignment to separate core and accessory chromosome. From NUCmer alignment I generated delta files which I filtered using options -r and -g and generated the coordinate file. This coordinate file looks like this:
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS]
===============================================================================================================================
3 1062 | 2882 3943 | 1060 1062 | 87.29 | 47164 22944 | 2.25 4.63 | sca_100_unmapped scaffold_479
3 1046 | 2196 3240 | 1044 1045 | 88.52 | 47164 201231 | 2.21 0.52 | sca_100_unmapped scaffold_68
2091 2303 | 24338 24550 | 213 213 | 88.02 | 47164 27763 | 0.45 0.77 | sca_100_unmapped scaffold_442
9756 11454 | 108083 106395 | 1699 1689 | 93.47 | 47164 181231 | 3.60 0.93 | sca_100_unmapped scaffold_81
13817 15198 | 54353 55731 | 1382 1379 | 87.49 | 47164 146674 | 2.93 0.94 | sca_100_unmapped scaffold_110
46400 46664 | 7992 7731 | 265 262 | 84.27 | 47164 30552 | 0.56 0.86 | sca_100_unmapped scaffold_418
2236 3032 | 64822 65618 | 797 797 | 83.71 | 46409 72978 | 1.72 1.09 | sca_101_unmapped scaffold_232
2239 3578 | 21278 19939 | 1340 1340 | 79.63 | 46409 28656 | 2.89 4.68 | sca_101_unmapped scaffold_438
11309 11945 | 41233 40596 | 637 638 | 85.76 | 46409 48260 | 1.37 1.32 | sca_101_unmapped scaffold_316
12138 12918 | 40117 39337 | 781 781 | 86.04 | 46409 48260 | 1.68 1.62 | sca_101_unmapped scaffold_316
12840 16991 | 198620 202766 | 4152 4147 | 85.95 | 46409 284610 | 8.95 1.46 | sca_101_unmapped scaffold_48
24138 24287 | 48814 48963 | 150 150 | 96.67 | 46409 178768 | 0.32 0.08 | sca_101_unmapped scaffold_84
As you can see from the table one of my scaffold in reference genome is matching with many scaffolds in the query genome. Another problem I have is the higher number of scaffolds in both of my reference and query genome. I am having trouble on how to further filter my result and separate the core and accessory region in my query genome. I am stuck in this step from quite some time and I could not find any resource which will tell me what to do. I will really appreciate for any suggestions.
Thank you, Ambika
Not directly answering your question, but I suggest trying Anvi'o. I don't use the program, but I know someone who dealt with the same issue as you and Anvi'o was his solution. It has a nice graphical interface and many tutorials are available.
If this is a complete table, there isn't much overlap between the two genomes. Aside from a single ~4kb match, everything else is below 2kb. It may be more informative if you translate/annotate the genomes, and compare at protein level.
It is not a complete table I have some overlap regions more than 20kb as well. I do have the gene annotation file. Do you think comparing the protein sequences will help in distinguishing the accessory region of the genome. And thank you for suggesting that program I will look into that.