I am writing a program for getting some dataset and train a model later. I have this script so far. Basically here I am getting the position of every codon corresponding to the sequence previously split into codons (codon_list
).
This is the script.
from pathlib import Path
import itertools
def split(str, num):
return [ str[start:start+num] for start in range(0, len(str), num) ]
with open("/home/darteagam/diploma/bert/files/bert_aa_example.txt", "r") as f1:
list_aa = []
for aa in f1:
prot_seq = list(aa)
lp = len(prot_seq)
position_aa = prot_seq[30:31]
list_aa.append(position_aa)
#print(list_aa)
with open("/home/darteagam/diploma/bert/files/bert_nn_example.txt", "r") as f2:
codon_list = []
for nuc_seq in f2:
#print(nuc_seq)
x=3
spl=[nuc_seq[y-x:y] for y in range(x, len(nuc_seq)+x,x)] # split the nn sequences in codons
#print(spl)
codon_list.append(spl[30:31]) # appending the 31 codon to the list
#print(codon_list)
data_f = pd.DataFrame(
{'position':
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11,
11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23,
23 : 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32, 32: 33, 33: 34, 34: 35,
35: 36, 36: 37, 37: 38, 38: 39, 39: 40, 40: 41, 41: 42, 42: 43, 43: 44, 44: 45, 45: 46, 46: 47, 47: 48,
48: 49, 49: 50, 50: 51, 51: 52, 52: 53, 53: 54, 54: 55, 55: 56, 56: 57, 57: 58, 58: 59, 59: 60, 60: 61},
'codon':
{0:"GCT",1:"GCC",2:"GCA",3:"GCG",4:"CGT",5:"CGC",6:"CGA",7:"CGG",8:"AGA",9:"AGG",10:"AAT",
11:"AAC",12:"GAT",13:"GAC",14:"TGT",15:"TGC",16:"CAA",17:"CAG",18:"GAA",19:"GAG",20:"GGT",
21:"GGC",22:"GGA",23:"GGG",24:"CAT",25:"CAC",26:"ATT",27:"ATC",28:"ATA",29:"TTA",30:"TTG",
31:"CTT",32:"CTC",33:"CTA",34:"CTG",35:"AAA",36:"AAG",37: "ATG",38:"TTT",39:"TTC",40:"CCT",
41:"CCC",42:"CCA",43:"CCG", 44:"TCT",45:"TCC",46:"TCA",47:"TCG",48:"AGT",49:"AGC",50:"ACT",
51:"ACC",52:"ACA",53:"ACG",54:"TGG",55:"TAT",56:"TAC",57:"GTT",58:"GTC",59:"GTA",60:"GTG"},
'aminoacid':
{0:'A', 1:'A', 2:'A', 3:'A', 4:'R', 5:"R", 6:"R",7:"R",8:"R",9:"R",10:"N",11:"N", 12:"D",13:"D",
14:"C", 15:"C",16:"Q",17:"Q",18:"E",19:"E",20:"G",21:"G",22:"G",23:"G",24:"H",25:"H",26:"I",
27:"I",28:"I",29:"L",30:"L",31:"L",32:"L",33:"L",34:"L",35:"K",36:"K",37:"M",38:"F",39:"F",40:"P",
41:"P",42:"P",43:"P",44:"S",45:"S",46:"S",47:"S",48:"S",49:"S",50:"T",51:"T",52:"T",53:"T",54:"W",
55:"Y",56:"Y",57:"V",58:"V",59:"V",60:"V"}})
#print(data_f)
new_codonlist = list(itertools.chain(*codon_list))
print(new_codonlist)
for c in new_codonlist:
where = (data_f['codon'] == c)
if where.any():
pos = data_f.at[where.idxmax(), 'position']
#print(pos)
print(f"codon {c}: position {pos}")
else:
print(f"codon {c} not found")
And the output is:
codon ATC: position 28
codon AAC: position 12
codon ACC: position 52
codon TTT: position 39
codon GTC: position 59
codon CTC: position 33
However, the next step will be to use these positions and create a 0 and 1 vector, length 61, where 1 corresponds to the position of codon and the rest of positions will be 0. Like this:
000000000000000000000000010000000000000000000000000000000000
with 1 in the position 28
0000000000010000000000000000000000000000000000000000000000000
with 1 in the position 12
And so on for the hole codons and sequences.
How could I get such vector?
I was thinking that maybe it's possible to create an initial vector, like this:
initial_v = ("0000000000000000000000000000000000000000000000000000000000000")
And then find the positions of codons and replace with 1 in the corresponding position of vector. But this is the issue that I don't know how to face. Please advice how to do it.
Omg! That was really easy! Thank you Shred