I have this output file from mummer, which shows mulpitle 100% Identity Ref/Query. My reference is the gene DNMT1 (specie A) and my query is a bunch of scaffold for a specie that hasn't been mapped yet to the chromosome level (closely related specie B). Other genes showed only one scaffold (query) having 100% identity and multiple alignment showed that indeed only a few SNPs are present in the whole gene. Though in certain genes, mummer returns a bunch 100% identity scaffold which do not align at all to the reference (or with millions of SNPs).
I already thought of looking at scaffold having a maximum COV Q and COV R, but these parameters are variable with the size and does not tell me the amount of SNPs in the sequence.
Is there a way for me to filter more effectively which scaffold is truly a match to my reference?
Here is a sample of what my show-coords output file looks like for 100% Identity:
[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [COV R] [COV Q] [TAGS]
13898 13988 57004 57094 91 91 100 35452 107079 0.26 0.08 gi|194246388:49748443-49783894 scaffold11691
13898 13964 949 1015 67 67 100 35452 36637 0.19 0.18 gi|194246388:49748443-49783894 scaffold15913
13898 13964 1040 974 67 67 100 35452 2729 0.19 2.46 gi|194246388:49748443-49783894 scaffold47627
13900 13983 78963 79046 84 84 100 35452 161922 0.24 0.05 gi|194246388:49748443-49783894 scaffold12557
13900 13989 11716 11627 90 90 100 35452 30624 0.25 0.29 gi|194246388:49748443-49783894 scaffold25396
13900 13970 1089 1019 71 71 100 35452 1502 0.2 4.73 gi|194246388:49748443-49783894 scaffold27987
19908 19994 10242 10328 87 87 100 35452 17701 0.25 0.49 gi|194246388:49748443-49783894 scaffold52071
19909 19994 278272 278357 86 86 100 35452 353920 0.24 0.02 gi|194246388:49748443-49783894 scaffold1991
19910 19996 19486 19400 87 87 100 35452 81941 0.25 0.11 gi|194246388:49748443-49783894 scaffold14370
19910 19994 1805 1889 85 85 100 35452 2036 0.24 4.17 gi|194246388:49748443-49783894 scaffold46791
19911 19986 510 585 76 76 100 35452 1364 0.21 5.57 gi|194246388:49748443-49783894 scaffold84138
19912 19997 9499 9414 86 86 100 35452 61074 0.24 0.14 gi|194246388:49748443-49783894 scaffold44157
19912 19997 8587 8502 86 86 100 35452 15813 0.24 0.54 gi|194246388:49748443-49783894 scaffold9318
19922 19998 939 863 77 77 100 35452 1465 0.22 5.26 gi|194246388:49748443-49783894 scaffold35518
19928 19999 1018 1089 72 72 100 35452 1502 0.2 4.79 gi|194246388:49748443-49783894 scaffold27987
19929 19995 23559 23493 67 67 100 35452 28327 0.19 0.24 gi|194246388:49748443-49783894 scaffold27519
19932 19997 344 409 66 66 100 35452 5264 0.19 1.25 gi|194246388:49748443-49783894 scaffold13914
19935 20000 974 1039 66 66 100 35452 2729 0.19 2.42 gi|194246388:49748443-49783894 scaffold47627
Here is the code I use to retrieve these information:
(Specie A: reference genome is known // Specie B: reference genome currently unknown, only scaffolds are available).
$ nucmer --prefix=ref_qry Gene_Specie_A.fasta Specie_B.fa
$ show-coords -rclT ref_qry.delta > ref_qry.coords
I have found a solution: https://biohpc.cornell.edu/doc/alignment_exercise2.html
i hope it usefull for somebody.