overlapping nucleotides in sequence alignment
2
0
Entering edit mode
8.7 years ago

Hi,

Is there a way to calculate the overlapping nucleotides numbers between every two DNA sequences by doing multiple sequence alignment? I would like to get a matrix output showing the overlapping nucleotide numbers. Thank you!

alignment • 4.4k views
ADD COMMENT
1
Entering edit mode

Please clearify what you are meaning by 'overlap'. The term overlap is usually used to indicate the relative position of two contiguous stretches of sequence, for example the overlap between two annotated features. You may want to calculate the number of identical positions of two sequences in a multiple sequence alignment. Nearly every phylogenetical program will do this. This task is also a nice exercise if you are learning programming.

ADD REPLY
0
Entering edit mode
8.6 years ago

Use Genedoc program that runs under Windows. It is extremely flexible and configurable, to that extend, that you will need to go to the tutorial to learn from it. But it will give you that matrix you are asking for

ADD COMMENT
0
Entering edit mode
8.6 years ago
chen ★ 2.5k

If you know Julia, you can use OpenGene(https://github.com/OpenGene/OpenGene.jl) to do this

using OpenGene, OpenGene.Algorithm

r1=dna("CAGCTTGAGGCTGAAGTTCTCCTTCTTCAGGTCATTGAGGTGCTGGGACAGAGTGCGATATCCATTAGACATGATGGGCAACCCATGGGGAGGAGCGTGCCCGATTGCCCCCTCAACCCAGGAACATGGTGCAGCTGCACCGCA")
r2=dna("AGCCGGCTGGCTTCATGCTGCGGTGCAGCTGCACCATGTTCCTGGGTTGAGGGGGCAATCGGGCACGCTCCTCCCCATGGGTTGCCCATCATGTCTAATGGATATCGCACTCTGTCCCAGCACCTCAATGACCTGAAGAAGGA")
offset, overlap_len, distance = overlap(r1, r2)
# it returns (19,125,0), which means 125 base pairs are overlapped

Note that this overlap function takes r1 as forward strand and r2 as reverse complement strand, which is the output of pair-end sequencing.

If you want to implement this by yourself, you can use editdistance method to do this, here I give u a piece of python code, which is used in AfterQC(https://github.com/OpenGene/AfterQC):

COMP = {"A" : "T", "T" : "A", "C" : "G", "G" : "C", "a" : "t", "t" : "a", "c" : "g", "g" : "c", "N":"N"}

def complement(base):
    return COMP[base]

def reverseComplement(origin):
    length = len(origin)
    revCompArr = ['' for x in xrange(length)]
    for i in xrange(length):
        orig = origin[length - i -1]
        if orig in COMP:
            revCompArr[i] = COMP[orig]
        else:
            revCompArr[i] = 'N'
    return ''.join(revCompArr)

def overlap(r1, r2):
    len1 = len(r1)
    len2 = len(r2)
    reverse_r2 = reverseComplement(r2)

    overlapped = False
    overlap_len = 0
    offset = 0
    distance = 0
    # a match of less than 10 is considered as unconfident
    for offset in xrange(0,len1-10):
        # the overlap length of r1 & r2 when r2 is move right for offset
        overlap_len = min(len1-offset, len2)

        # remind that Julia is a 1-based coordination system
        distance = editDistance(r1[offset : offset+overlap_len], reverse_r2[0 : overlap_len])
        if distance <= distance_threshold(overlap_len):
            # now we find a good candidate
            # we verify it by moving r2 one more base to see if the distance is getting longer
            # if yes, then current is the best match, otherwise it's not
            next = offset + 1
            next_overlap_len = min(len1-next, len2)
            distance2 = editDistance(r1[next : next+next_overlap_len], reverse_r2[0 : next_overlap_len])
            if distance <= distance2:
                overlapped = True
                break

    if overlapped and offset == 0:
        # check if distance can get smaller if offset goes negative
        # this only happens when insert DNA is shorter than sequencing read length, and some adapter/primer is sequenced but not trimmed cleanly
        # we go reversely
        for offset in xrange(0,-(len2-10), -1):
            # the overlap length of r1 & r2 when r2 is move right for offset
            overlap_len = min(len1,  len2- abs(offset))
            distance = editDistance(r1[0:overlap_len], reverse_r2[-offset : -offset + overlap_len])
            if distance <= distance_threshold(overlap_len):
                next = offset - 1
                next_overlap_len = min(len1,  len2- abs(next))
                distance2 = editDistance(r1[0:next_overlap_len], reverse_r2[-next : -next + next_overlap_len])
                if distance <= distance2:
                    return (offset, overlap_len, distance)
    elif overlapped:
        return (offset, overlap_len, distance)

    return (0,0,0)
ADD COMMENT

Login before adding your answer.

Traffic: 2736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6