Question

How to output a new alignment containing only sequences with a particular residue in a specified column

1

Entering edit mode

10.8 years ago

julestrachsel ▴ 20

Hello!

I am very new to biopython and I am trying to accomplish what I think is a simple task: I would like to remove sequences from a protein alignment that do not contain a particular residue at a specified position. I would like to be able to input a protein alignment in fasta format and then output a new alignment where all the sequences that do not meet my criteria are removed

For example: My input protein alignment contains sequences that have a mixture of residues at position 137. I would like to output a new alignment that contains only sequences that have either an arginine or a valine at position 137.

Just a bit of additional clarification: I am sequencing an amplicon of a functional gene and generating protein sequence alignments using RDP's fungene pipeline. I want to further screen the alignment by eliminating any sequences that do not contain a selection of conserved residues at various positions.

Thank you very much for your time.

-J

biopython alignment • 5.5k views

ADD COMMENT • link updated 4.3 years ago by Ram 45k • written 10.8 years ago by julestrachsel ▴ 20

Ram · Accepted Answer · 2014-10-22

4

Entering edit mode

10.8 years ago

Bioinformatics_NewComer ▴ 330

python script.py <file.fasta>

read in the sequence from Bio import SeqIO and store them in a dictionary. Print Sequences in a file.
align the sequence file with tool of your choice.
read in the alignment using from Bio import AlignIO
iterate through the alignment, and check the residues you are interested.
make a list of the ones to kept or thrown. Delete your alignment file.
Print the remaining sequences and re-do alignment.

Hope I haven't confused it.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 10.8 years ago by Bioinformatics_NewComer ▴ 330

0

Entering edit mode

Thank you so much for the answer!

It was just the advice I needed.

This is the python file I made to accomplish my goal (it's probably really ugly to anyone who has actual python experience)

from Bio import AlignIO

from Bio.Align import MultipleSeqAlignment
alignment = AlignIO.read("MyProt.fasta", "fasta")     # my input alignment

goodseqs1 = MultipleSeqAlignment([])                  # sets up an empty MSA that good sequences can be added to
goodseqs2 = MultipleSeqAlignment([])                  # MSA that good seqs can be added to for another round of screening
badseqs = MultipleSeqAlignment([])                    # MSA that seqs not meeting the criteria are added to
for sequence in alignment:
    if sequence.seq[48] == "F":
        goodseqs1.append(sequence)                    # adds all sequences with "F" at position 48 to goodseqs1 alignment
    elif sequence.seq[48] == "Y":
        goodseqs1.append(sequence)                    # adds all seqs with a "Y" at position 48 to goodseqs1 align
    else:
        badseqs.append(sequence)                      # puts all remaining seqs in badseqs

for sequence in goodseqs1:                            # additional round of screening for seqs that passed the first round
    if sequence.seq[46] == "Q":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "R":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "G":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "I":
        goodseqs2.append(sequence)
    else:
        badseqs.append(sequence)

AlignIO.write(goodseqs2, "SCREENED_SEQS.FASTA", "fasta")            # writes a fasta alignment containing only passing seqs
AlignIO.write(badseqs, "Discard.fasta", "fasta")                    # writes a fasta alignment containing only failing seqs

print("Alignment length %i" % alignment.get_alignment_length())     # prints the length of the alignment
print("initial # of seqs", len(alignment))                          # prints the # of seqs in the initial alignment
print("seqs passing first screen:", len(goodseqs1))                 # prints the # of seqs passing the first screen
print("final # of seqs", len(goodseqs2))                            # prints the # of seqs passing both screens
print("output files are: SCREENED_SEQS.FASTA and Discard.fasta")    # prints the names of the output files

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 10.8 years ago by julestrachsel ▴ 20

Ram · Accepted Answer · 2014-10-23

Thank you so much for the answer!

It was just the advice I needed.

This is the python file I made to accomplish my goal (it's probably really ugly to anyone who has actual python experience)

from Bio import AlignIO

from Bio.Align import MultipleSeqAlignment
alignment = AlignIO.read("MyProt.fasta", "fasta")     # my input alignment

goodseqs1 = MultipleSeqAlignment([])                  # sets up an empty MSA that good sequences can be added to
goodseqs2 = MultipleSeqAlignment([])                  # MSA that good seqs can be added to for another round of screening
badseqs = MultipleSeqAlignment([])                    # MSA that seqs not meeting the criteria are added to
for sequence in alignment:
    if sequence.seq[48] == "F":
        goodseqs1.append(sequence)                    # adds all sequences with "F" at position 48 to goodseqs1 alignment
    elif sequence.seq[48] == "Y":
        goodseqs1.append(sequence)                    # adds all seqs with a "Y" at position 48 to goodseqs1 align
    else:
        badseqs.append(sequence)                      # puts all remaining seqs in badseqs

for sequence in goodseqs1:                            # additional round of screening for seqs that passed the first round
    if sequence.seq[46] == "Q":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "R":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "G":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "I":
        goodseqs2.append(sequence)
    else:
        badseqs.append(sequence)

AlignIO.write(goodseqs2, "SCREENED_SEQS.FASTA", "fasta")            # writes a fasta alignment containing only passing seqs
AlignIO.write(badseqs, "Discard.fasta", "fasta")                    # writes a fasta alignment containing only failing seqs

print("Alignment length %i" % alignment.get_alignment_length())     # prints the length of the alignment
print("initial # of seqs", len(alignment))                          # prints the # of seqs in the initial alignment
print("seqs passing first screen:", len(goodseqs1))                 # prints the # of seqs passing the first screen
print("final # of seqs", len(goodseqs2))                            # prints the # of seqs passing both screens
print("output files are: SCREENED_SEQS.FASTA and Discard.fasta")    # prints the names of the output files