How is sequence identity affected by non-end gaps?
1
0
Entering edit mode
9.8 years ago
n00514450 ▴ 20

From what I read, the sequences

AACTG
AACTGAC

have 100% identity, while the sequences

AACTG
AACTC

have 80% identity. What about

AATCG--C
AATCGAAC
alignment • 2.9k views
ADD COMMENT
0
Entering edit mode

I like to use (#matches)/(alignment length), where alignment length includes gaps, because it is symmetric; you get the same identity regardless of which is the query and which is the reference. I count an N as 0.25 matches.

In practice, identity is often not a good metric, because it either gives exaggeratedly low scores to sequences with long indels, or ignores indels completely, neither of which makes much sense.

ADD REPLY
2
Entering edit mode
9.8 years ago
Brice Sarver ★ 3.8k

There are many ways to calculate percent identity. Some take every character in a DNA string and compare it to another, whereas others explicitly exclude gapped or ambiguous sites. In your third case, the % identity depends on whether or not you include the sites that are gapped in the first sequence.

For a very short yet good synopsis, I recommend this paper by Alex May, "Percent Sequence Identity: The Need to be Explicit."

ADD COMMENT
0
Entering edit mode

Thank you. I guess the best thing to do is just ask my professor how he defines it.

ADD REPLY

Login before adding your answer.

Traffic: 2156 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6