Question

[Newbie question] Best way to replace nucleotides in fasta files, numpy array or dictionaries?

0

Entering edit mode

4.7 years ago

Baeb • 0

Hey I'm a new student in bioinformatics and I'm working on this project - I want to replace some nucleotides with a missing "-", let's say I want to replace a bit from the beginning of the sequence, and a bit from the end of the sequence. How should I go about doing this, and in a scalable manner?

this is the code I have so far. I'm not sure how to edit these sequences, is it better if I use a numpy array? What do I use to write

fasta = {}

with open('example.fasta') as file_one:
    for line in file_one:
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            active_sequence_name = line[1:]
            if active_sequence_name not in fasta:
                fasta[active_sequence_name] = []
            continue
        sequence = line
        fasta[active_sequence_name].append(sequence)

seqMat = np.array(fasta)

output: {'seq1': ['AAATATATATATATATATTATATATTATATATATTATATATATAT'], 'seq2': ['GCGCGAGATAGGGCGCGCGCGCGCGATTAGCGAGGCGCGCGCGGC'], 'seq3': ['TCTCTCTCTCTCTCTTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC']}

And this is what I have as an array. What is the best way to replace nucleotides?

from Bio import SeqIO
import os
import numpy as np

pathToFile = open("example.fasta")

allSeqs = []
for seq_record in SeqIO.parse(pathToFile, """fasta"""):
        allSeqs.append(seq_record.seq)
seqMat = np.array(allSeqs)

Output: array([['A', 'A', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T', 'A', 'T'], ['G', 'C', 'G', 'C', 'G', 'A', 'G', 'A', 'T', 'A', 'G', 'G', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'A', 'T', 'T', 'A', 'G', 'C', 'G', 'A', 'G', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'C', 'G', 'G', 'C'], ['T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C', 'T', 'C']], dtype='<u1')< p="">

fasta python array numpy dictionary • 1.5k views

ADD COMMENT • link 4.7 years ago by Baeb • 0

1

Entering edit mode

Can you explain exactly what governs which characters are going to be replaced?

In any case, the general strategy for modifying strings is to use the replace method, but it can get more complicated if you need to specify which end and by how much etc.

There's not a super compelling reason to do this with numpy that I can see. As to the specific question about replacing characters in a dictionary, bear in mind that in python everything is just an object, so that dictionary is just a wrapper around a string object, and so the same processes apply in general.

You are already using the foo[0:999] notation, which is really all you need to replace x chars at the start and end.

The only real complication here may be if you start using biopython which doesn't allow you to directly modify sequences - you have to jump through a few other hoops - but based on your code you aren't doing that anyway.

ADD REPLY • link 4.7 years ago by Joe 22k