Biopython - new MSA from specific columns of old MSA
1
3
Entering edit mode
4.2 years ago

I want to start by apologizing about the vague title and my overall ignorance. I have no idea what I'm doing, and the few scripts I've gotten to work have been pieced together from scraps of code taken from this and other websites rather than genuinely written by me. I'm working in Python using some biopython code, and that's the limit of my coding comfort zone so far.

I'd like to take an MSA and an input list of positions to create a new MSA (or other output file type) with all of the same rows but only the columns for positions specified in the list. Ideally the identifiers for the sequences in the first MSA will carry over to the output. I've had problems getting this to work:

positions = [18,20,26,33,83,86,87,88,133,517]
outputMSA = MultipleSeqAlignment([])
inputMSA = AlignIO.read("input.fst", "fasta")
for y in inputMSA:
    for x in positions:
        outputMSA.append(inputMSA[y,x-1])
print(outputMSA)

I get "TypeError: list indices must be integers or slices, not SeqRecord"

Assuming I could get this to work, the next step would be to take a reference sequence for those positions and the output MSA, and list all of the unique sequences in the output MSA, the frequency of each one, and the frequency of the reference sequence.

Example inputs/outputs:

positions = [3, 8, 9, 11]
input MSA:
IAMAWFLATTHIS
IAMAWFLATTHIS
IAMAWFLATTHIS
IA-AWFLATTHIS

output MSA:
MATH
MATH
MATH
-ATH

referenceseq = MASH

output analysis:
MASH 0%,
MATH 75%,
-ATH 25%

UPDATE: My probably inelegant solution:

inputMSA = AlignIO.read("input.fst", "fasta")
tempMSA = inputMSA[:,0:0]
for x in positions:
    outputMSA = tempMSA[:,:] + inputMSA[:, (x-1):x]
    tempMSA = outputMSA
print(outputMSA)

Still working on the second part.

alignment • 1.4k views
ADD COMMENT
0
Entering edit mode

Please add updates as comments instead of editing the post.

ADD REPLY
0
Entering edit mode

UPDATE: My probably inelegant solution:

inputMSA = AlignIO.read("input.fst", "fasta")
tempMSA = inputMSA[:,0:0]
for x in positions:
    outputMSA = tempMSA[:,:] + inputMSA[:, (x-1):x]
    tempMSA = outputMSA
print(outputMSA)

Still working on the second part. I thought this would be the easy part, but I'm stuck again.

ADD REPLY
0
Entering edit mode
3.0 years ago

A little late on this, but here is a different solution that should be more efficient to create a new MSA specifying specific positions. To summarize, you want to convert the MultipleSeqAlignment object to a 2D numpy array where each element is a nucleotide. Once you do that you can select the exact loci you want to include and then convert the numpy array back into a MultipleSeqAlignment object.

from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
positions = [18,20,26,33,83,86,87,88,133,517]
inputMSA = AlignIO.read("input.fst", "fasta")
msa_array = np.array([list(rec) for rec in inputMSA], dtype=str)

align_contig_ids = [rec.id for rec in align]
new_numpy_array = msa_array[:,positions]

seq_record_list = []
counter = 0
for x in new_numpy_array:
    dna_str = "".join(x)
    record_temp = SeqRecord(Seq(dna_str),id=align_contig_ids[counter],description="")
    seq_record_list.append(record_temp)
    counter = counter + 1

filtered_msa = MultipleSeqAlignment(seq_record_list)
ADD COMMENT

Login before adding your answer.

Traffic: 1116 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6