Code output inconsistency
1
0
Entering edit mode
3.5 years ago
clifform • 0

Hi everyone, this is my first post here. I'm a total beginner at bioinformatics, and truthfully at coding in general. I'm taking on a little self-guided project to try to learn some stuff on my own, but I've run into a weird snag. I'm suddenly not getting the expected output despite having not changed anything about the relevant parts of the code.

Here's an example. In the script I'm writing below, I'm importing a list of random peptide "words" and then converting the sequence into every possible codon configuration regardless of usage. Let it be known that I found this from a previous post from this forum, so I don't understand everything about it. In particular, I don't really get what's happening in the block where def generator(protein): is defined. Everything else is more intuitive to me.

At any rate, I rewrote the last portion of the original code to pull the data from the list that was imported and to run the script on everything in it. I was playing around with the output by changing the first line of the last 'for loop' - i.e. I changed up wordslist to be everything from a random short AA sequence that I manually type in, to a single index on my list (e.g. wordslist[0]), to a range of indices on my list (e.g. wordslist[0:1]), and finally on the entire list. Previously, I had success with getting the expected output on all of these. For example, using 'MKS' as the input, the expected output is this:

ATGAAAAGC
ATGAAAAGT
ATGAAATCA
ATGAAATCC
ATGAAATCG
ATGAAATCT
ATGAAGAGC
ATGAAGAGT
ATGAAGTCA
ATGAAGTCC
ATGAAGTCG
ATGAAGTCT

and I consistently got that output back every time. However, now, I'm seeing that I only get the expected output when I run the script on either a range of indices or on the full list. I can no longer get the expected output when I limit the range to a single list index, or if I manually enter some random AA sequence. For example, still using 'MKS' as the input, now I see

ATG
AAA
AAG
AGC
AGT
TCA
TCC
TCG
TCT

and I'm annoyed because I don't know what changed. I was achieving the expected output before, but now I can't troubleshoot my way back to it behaving smoothly again. Can you guys give me some help to fix this?

Full code below:

### IMPORT A TXT FILE AS A LIST ###

words = open('words.txt','r')
wordsimport = words.readlines()
wordslist = []

for w in wordsimport:
    wordslist.append(w.strip())

words.close()

### CONVERT AA SEQUENCE TO ALL POSSIBLE CODONS ###

import itertools

d = {
    'A': ['GCA', 'GCC', 'GCG', 'GCT'],
    'C': ['TGC', 'TGT'],
    'D': ['GAC', 'GAT'],
    'E': ['GAA', 'GAG'],
    'F': ['TTC', 'TTT'],
    'G': ['GGA', 'GGC', 'GGG', 'GGT'],
    'H': ['CAC', 'CAT'],
    'I': ['ATA', 'ATC', 'ATT'],
    'K': ['AAA', 'AAG'],
    'L': ['CTA', 'CTC', 'CTG', 'CTT', 'TTA', 'TTG'],
    'M': ['ATG'],
    'N': ['AAC', 'AAT'],
    'P': ['CCA', 'CCC', 'CCG', 'CCT'],
    'Q': ['CAA', 'CAG'],
    'R': ['AGA', 'AGG', 'CGA', 'CGC', 'CGG', 'CGT'],
    'S': ['AGC', 'AGT', 'TCA', 'TCC', 'TCG', 'TCT'],
    'T': ['ACA', 'ACC', 'ACG', 'ACT'],
    'V': ['GTA', 'GTC', 'GTG', 'GTT'],
    'W': ['TGG'],
    'Y': ['TAC', 'TAT'],
}

def generator(protein):
    l = [d[aa] for aa in protein]
    for comb in itertools.product(*l):
        yield "".join(comb)

for protein_seq in wordslist[0]:
    g = generator(protein_seq)
    for dna_seq in g:
        print(dna_seq)
codon python • 803 views
ADD COMMENT
0
Entering edit mode
3.5 years ago
clifform • 0

Well this didn't get any attention, but in case anyone comes across this in the future and is interested in how this was resolved, the problem was in this section:

for protein_seq in wordslist[0]:
    g = generator(protein_seq)
    for dna_seq in g:
        print(dna_seq)

wordslist[0] needed to be formatted as [wordslist[0]] in order for the code to run successfully. Using this formatting, I can now also run a manual input, e.g. MKS, and get the expected output of 18bp sequences.

ADD COMMENT
0
Entering edit mode

Sorry you didn't get any traction on this. While I haven't looked at the code closely, my immediate thought when you described your problem was that you were probably running in to some sort of iterator/generator behaviour or perhaps memoization/cacheing.

If it helps, I suspect the issue you were finding stems from the fact that a generator in python, is a special type of object which returns one result at a time from an iterator - the secret sauce here is the yield statement in the function definition, as opposed to using the more familiar return. This means that every time you call the function, you get a different result from last time. Take a look here for more info: https://towardsdatascience.com/6-examples-to-master-python-generators-28f4c614ed45

But one of the toy examples you can try to satisfy yourself of this is to define a generator (as in that link):

# Define a generator (the use of yield is what makes it 'special')
def mygenerator(n):
    for i in range(1, n, 2):
        yield i**3

# Assign it to a new object:
gen = mygenerator(10)

# Manually step through the generator using the keyword 'next'
>>> next(gen)
1
>>> next(gen)
27
>>> next(gen)
125
>>> next(gen)
343
>>> next(gen)
729
>>> next(gen)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Generators cannot be 're-iterated' without being re-made, so you will get a StopIteration exception when you try to use next() at the end of the generator.

Hopefully you can see how here you get a different result each time you call the function (for a given instance of a generator), and maybe this was some useful background info in any case.

Generators are a pretty advanced feature of python though, so don't feel put out if they don't seem immediately very intuitive (I still struggle with them half the time).

ADD REPLY

Login before adding your answer.

Traffic: 1285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6