Get a vector of 0 and 1 using codon positions
1
0
Entering edit mode
2.0 years ago
dbykov • 0

I am writing a program for getting some dataset and train a model later. I have this script so far. Basically here I am getting the position of every codon corresponding to the sequence previously split into codons (codon_list).

This is the script.

from pathlib import Path
import itertools

def split(str, num):
    return [ str[start:start+num] for start in range(0, len(str), num) ]

with open("/home/darteagam/diploma/bert/files/bert_aa_example.txt", "r") as f1:
    list_aa = []
    for aa in f1:
        prot_seq = list(aa)
        lp = len(prot_seq)
        position_aa = prot_seq[30:31]
        list_aa.append(position_aa)
#print(list_aa)

with open("/home/darteagam/diploma/bert/files/bert_nn_example.txt", "r") as f2:  
    codon_list = []
    for nuc_seq in f2:
        #print(nuc_seq)
        x=3 
        spl=[nuc_seq[y-x:y] for y in range(x, len(nuc_seq)+x,x)]  # split the nn sequences in codons
        #print(spl)
        codon_list.append(spl[30:31])  # appending the 31 codon to the list
#print(codon_list)

data_f = pd.DataFrame(
{'position':
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11,
11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 
23 : 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32, 32: 33, 33: 34, 34: 35,
35: 36, 36: 37, 37: 38, 38: 39, 39: 40, 40: 41, 41: 42, 42: 43, 43: 44, 44: 45, 45: 46, 46: 47, 47: 48,
48: 49, 49: 50, 50: 51, 51: 52, 52: 53, 53: 54, 54: 55, 55: 56, 56: 57, 57: 58, 58: 59, 59: 60, 60: 61}, 
'codon': 
{0:"GCT",1:"GCC",2:"GCA",3:"GCG",4:"CGT",5:"CGC",6:"CGA",7:"CGG",8:"AGA",9:"AGG",10:"AAT",
11:"AAC",12:"GAT",13:"GAC",14:"TGT",15:"TGC",16:"CAA",17:"CAG",18:"GAA",19:"GAG",20:"GGT",
21:"GGC",22:"GGA",23:"GGG",24:"CAT",25:"CAC",26:"ATT",27:"ATC",28:"ATA",29:"TTA",30:"TTG",
31:"CTT",32:"CTC",33:"CTA",34:"CTG",35:"AAA",36:"AAG",37: "ATG",38:"TTT",39:"TTC",40:"CCT",
41:"CCC",42:"CCA",43:"CCG", 44:"TCT",45:"TCC",46:"TCA",47:"TCG",48:"AGT",49:"AGC",50:"ACT",
51:"ACC",52:"ACA",53:"ACG",54:"TGG",55:"TAT",56:"TAC",57:"GTT",58:"GTC",59:"GTA",60:"GTG"}, 
'aminoacid': 
{0:'A', 1:'A', 2:'A', 3:'A', 4:'R', 5:"R", 6:"R",7:"R",8:"R",9:"R",10:"N",11:"N", 12:"D",13:"D",
14:"C", 15:"C",16:"Q",17:"Q",18:"E",19:"E",20:"G",21:"G",22:"G",23:"G",24:"H",25:"H",26:"I",
27:"I",28:"I",29:"L",30:"L",31:"L",32:"L",33:"L",34:"L",35:"K",36:"K",37:"M",38:"F",39:"F",40:"P",
41:"P",42:"P",43:"P",44:"S",45:"S",46:"S",47:"S",48:"S",49:"S",50:"T",51:"T",52:"T",53:"T",54:"W",
55:"Y",56:"Y",57:"V",58:"V",59:"V",60:"V"}})

#print(data_f)
new_codonlist = list(itertools.chain(*codon_list))
print(new_codonlist)
for c in new_codonlist:
    where = (data_f['codon'] == c)
    if where.any():
        pos = data_f.at[where.idxmax(), 'position']
        #print(pos)
        print(f"codon {c}: position {pos}")
    else:
        print(f"codon {c} not found")

And the output is:

codon ATC: position 28
codon AAC: position 12
codon ACC: position 52
codon TTT: position 39
codon GTC: position 59
codon CTC: position 33

However, the next step will be to use these positions and create a 0 and 1 vector, length 61, where 1 corresponds to the position of codon and the rest of positions will be 0. Like this:

000000000000000000000000010000000000000000000000000000000000
with 1 in the position 28

0000000000010000000000000000000000000000000000000000000000000
with 1 in the position 12

And so on for the hole codons and sequences.

How could I get such vector?

I was thinking that maybe it's possible to create an initial vector, like this:

initial_v = ("0000000000000000000000000000000000000000000000000000000000000")

And then find the positions of codons and replace with 1 in the corresponding position of vector. But this is the issue that I don't know how to face. Please advice how to do it.

dna codon python pandas • 639 views
ADD COMMENT
2
Entering edit mode
2.0 years ago
Shred ★ 1.5k

It's extremely easy. Imagine that your position is 28 and maximum length is 61

pos = 28
vector = '0'*(pos-1) + '1' + '0'*(61-pos)
ADD COMMENT
0
Entering edit mode

Omg! That was really easy! Thank you Shred

ADD REPLY

Login before adding your answer.

Traffic: 2443 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6