Extracting four fold degenerate sites from nucleotide alignment
3
0
Entering edit mode
9.2 years ago
A • 0

Hi all and python users,

I am looking for a script written in python or a feature in Biopython or guidance for writing a script in python that would extract four fold degenerate sites from a nucleotide alignment file of about 90 nucleotide sequences. The output file should have four fold degenerate sites per sequence which I want to use to make a new alignment and build phylogeny.

Thinking that it would a simple thing to do, I tried to search (google) for it but very strangely, I am not finding right answer. Please help me get in the right direction.

Very many thanks,

A

alignment • 7.9k views
ADD COMMENT
2
Entering edit mode
9.2 years ago
Whetting ★ 1.6k

This should get you started. I borrowed most of the code from http://www.petercollingridge.co.uk/python-bioinformatics-tools/codon-table

bases = ['t', 'c', 'a', 'g']
codons = [a+b+c for a in bases for b in bases for c in bases]
amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
codon_table = dict(zip(codons, amino_acids))

def translate(seq):
    seq = seq.lower().replace('\n', '').replace(' ', '')
    peptide = ''

    for I in xrange(0, len(seq), 3):
        codon = seq[i: i+3]
        amino_acid = codon_table.get(codon, '*')
        if amino_acid != '*':
            peptide += amino_acid
        else:
            break

    return peptide

codon_dict={}
for aa in set(amino_acids):
    temp=[]
    for c in codons:
       if translate(c)==aa:
           temp.append(c)
    if len(temp)==4:
        codon_dict[aa]=temp

print codon_dict

Now you should be able to run through your alignment and check whether the codon is in the codon_dict...

ADD COMMENT
0
Entering edit mode

Many thanks for getting me started with the code. I'll give it a go and will let you know if successful!!

ADD REPLY
1
Entering edit mode
9.2 years ago
Brice Sarver ★ 3.8k

I was looking for something along these lines late last year. I came up short, and (to my knowledge) there still isn't anything available. I attribute this to 1) a lack of demand and 2) a perceived difficulty in programming the task.

But, it's easy enough that I bet everyone who has written a script just didn't make it public.

If you want to use Python, simply create a dictionary based on the translation table of your organism of interest and the degeneracy (0, 1, 2, 3, 4-fold) for each position given a codon. Then, scan across your nucleotides in triplets and identify the degeneracy from the table. It's just a bit of a pain to set up because you have to type it all in, but it shouldn't take much time. This could be done in any scripting language and is basically just string processing.

Sorry for no solution directly, but I hope this points you in the right direction.

ADD COMMENT
1
Entering edit mode
9.2 years ago
h.mon 35k

There is a script in Perl provided by thackl here. However, as I commented before, if you have a large number of ambiguous sites, the number of sequences and file size may grow very quickly.

ADD COMMENT

Login before adding your answer.

Traffic: 2674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6