Hey I'm a new student in bioinformatics and I'm working on this project - I want to replace some nucleotides with a missing "-", let's say I want to replace a bit from the beginning of the sequence, and a bit from the end of the sequence. How should I go about doing this, and in a scalable manner?
this is the code I have so far. I'm not sure how to edit these sequences, is it better if I use a numpy array? What do I use to write
fasta = {}
with open('example.fasta') as file_one:
for line in file_one:
line = line.strip()
if not line:
continue
if line.startswith(">"):
active_sequence_name = line[1:]
if active_sequence_name not in fasta:
fasta[active_sequence_name] = []
continue
sequence = line
fasta[active_sequence_name].append(sequence)
seqMat = np.array(fasta)
output: {'seq1': ['AAATATATATATATATATTATATATTATATATATTATATATATAT'], 'seq2': ['GCGCGAGATAGGGCGCGCGCGCGCGATTAGCGAGGCGCGCGCGGC'], 'seq3': ['TCTCTCTCTCTCTCTTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC']}
And this is what I have as an array. What is the best way to replace nucleotides?
from Bio import SeqIO
import os
import numpy as np
pathToFile = open("example.fasta")
allSeqs = []
for seq_record in SeqIO.parse(pathToFile, """fasta"""):
allSeqs.append(seq_record.seq)
seqMat = np.array(allSeqs)
Output: array([['A', 'A', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T'], ['G', 'C', 'G', 'C', 'G', 'A', 'G', 'A', 'T', 'A', 'G', 'G', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'A', 'T', 'T', 'A', 'G', 'C', 'G', 'A', 'G', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'G', 'C'], ['T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C']], dtype='<u1')< p="">
Can you explain exactly what governs which characters are going to be replaced?
In any case, the general strategy for modifying strings is to use the
replace
method, but it can get more complicated if you need to specify which end and by how much etc.There's not a super compelling reason to do this with
numpy
that I can see. As to the specific question about replacing characters in a dictionary, bear in mind that in python everything is just an object, so that dictionary is just a wrapper around astring
object, and so the same processes apply in general.You are already using the
foo[0:999]
notation, which is really all you need to replace x chars at the start and end.The only real complication here may be if you start using
biopython
which doesn't allow you to directly modify sequences - you have to jump through a few other hoops - but based on your code you aren't doing that anyway.