How to output a new alignment containing only sequences with a particular residue in a specified column
2
1
Entering edit mode
10.2 years ago

Hello!

I am very new to biopython and I am trying to accomplish what I think is a simple task: I would like to remove sequences from a protein alignment that do not contain a particular residue at a specified position. I would like to be able to input a protein alignment in fasta format and then output a new alignment where all the sequences that do not meet my criteria are removed

For example: My input protein alignment contains sequences that have a mixture of residues at position 137. I would like to output a new alignment that contains only sequences that have either an arginine or a valine at position 137.

Just a bit of additional clarification: I am sequencing an amplicon of a functional gene and generating protein sequence alignments using RDP's fungene pipeline. I want to further screen the alignment by eliminating any sequences that do not contain a selection of conserved residues at various positions.

Thank you very much for your time.

-J

biopython alignment • 5.1k views
ADD COMMENT
4
Entering edit mode
10.2 years ago
python script.py <file.fasta>
  • read in the sequence from Bio import SeqIO and store them in a dictionary. Print Sequences in a file.
  • align the sequence file with tool of your choice.
  • read in the alignment using from Bio import AlignIO
  • iterate through the alignment, and check the residues you are interested.
  • make a list of the ones to kept or thrown. Delete your alignment file.
  • Print the remaining sequences and re-do alignment.

Hope I haven't confused it.

ADD COMMENT
0
Entering edit mode

Thank you so much for the answer!

It was just the advice I needed.

This is the python file I made to accomplish my goal (it's probably really ugly to anyone who has actual python experience)

from Bio import AlignIO

from Bio.Align import MultipleSeqAlignment
alignment = AlignIO.read("MyProt.fasta", "fasta")     # my input alignment

goodseqs1 = MultipleSeqAlignment([])                  # sets up an empty MSA that good sequences can be added to
goodseqs2 = MultipleSeqAlignment([])                  # MSA that good seqs can be added to for another round of screening
badseqs = MultipleSeqAlignment([])                    # MSA that seqs not meeting the criteria are added to
for sequence in alignment:
    if sequence.seq[48] == "F":
        goodseqs1.append(sequence)                    # adds all sequences with "F" at position 48 to goodseqs1 alignment
    elif sequence.seq[48] == "Y":
        goodseqs1.append(sequence)                    # adds all seqs with a "Y" at position 48 to goodseqs1 align
    else:
        badseqs.append(sequence)                      # puts all remaining seqs in badseqs

for sequence in goodseqs1:                            # additional round of screening for seqs that passed the first round
    if sequence.seq[46] == "Q":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "R":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "G":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "I":
        goodseqs2.append(sequence)
    else:
        badseqs.append(sequence)

AlignIO.write(goodseqs2, "SCREENED_SEQS.FASTA", "fasta")            # writes a fasta alignment containing only passing seqs
AlignIO.write(badseqs, "Discard.fasta", "fasta")                    # writes a fasta alignment containing only failing seqs

print("Alignment length %i" % alignment.get_alignment_length())     # prints the length of the alignment
print("initial # of seqs", len(alignment))                          # prints the # of seqs in the initial alignment
print("seqs passing first screen:", len(goodseqs1))                 # prints the # of seqs passing the first screen
print("final # of seqs", len(goodseqs2))                            # prints the # of seqs passing both screens
print("output files are: SCREENED_SEQS.FASTA and Discard.fasta")    # prints the names of the output files
ADD REPLY
1
Entering edit mode
10.2 years ago

Thank you so much for the answer!

It was just the advice I needed.

This is the python file I made to accomplish my goal (it's probably really ugly to anyone who has actual python experience)

from Bio import AlignIO

from Bio.Align import MultipleSeqAlignment
alignment = AlignIO.read("MyProt.fasta", "fasta")     # my input alignment

goodseqs1 = MultipleSeqAlignment([])                  # sets up an empty MSA that good sequences can be added to
goodseqs2 = MultipleSeqAlignment([])                  # MSA that good seqs can be added to for another round of screening
badseqs = MultipleSeqAlignment([])                    # MSA that seqs not meeting the criteria are added to
for sequence in alignment:
    if sequence.seq[48] == "F":
        goodseqs1.append(sequence)                    # adds all sequences with "F" at position 48 to goodseqs1 alignment
    elif sequence.seq[48] == "Y":
        goodseqs1.append(sequence)                    # adds all seqs with a "Y" at position 48 to goodseqs1 align
    else:
        badseqs.append(sequence)                      # puts all remaining seqs in badseqs

for sequence in goodseqs1:                            # additional round of screening for seqs that passed the first round
    if sequence.seq[46] == "Q":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "R":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "G":
        goodseqs2.append(sequence)
    elif sequence.seq[46] == "I":
        goodseqs2.append(sequence)
    else:
        badseqs.append(sequence)

AlignIO.write(goodseqs2, "SCREENED_SEQS.FASTA", "fasta")            # writes a fasta alignment containing only passing seqs
AlignIO.write(badseqs, "Discard.fasta", "fasta")                    # writes a fasta alignment containing only failing seqs

print("Alignment length %i" % alignment.get_alignment_length())     # prints the length of the alignment
print("initial # of seqs", len(alignment))                          # prints the # of seqs in the initial alignment
print("seqs passing first screen:", len(goodseqs1))                 # prints the # of seqs passing the first screen
print("final # of seqs", len(goodseqs2))                            # prints the # of seqs passing both screens
print("output files are: SCREENED_SEQS.FASTA and Discard.fasta")    # prints the names of the output files
ADD COMMENT
1
Entering edit mode

You can perhaps do:

for sequence in alignment:
    if sequence.seq[48] == "F" or sequence.seq[48] == "Y":
        if sequence.seq[46] == "Q" or sequence.seq[46] == "R" or  sequence.seq[46] == "G" or sequence.seq[46] == "I":
            goodseq2. append(sequence)
ADD REPLY
1
Entering edit mode

Sorry, I'm not good in having code in reply very well.

Your code will work too, but having all the checks in a single go wouldn't be a bad idea too. :)

ADD REPLY

Login before adding your answer.

Traffic: 1878 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6