How do I remove sequences from my MSA that contain 50% gaps? I know there are various posts about removing columns with gaps. But I'm looking for a simple script to identify alignments within my MSA that have >50% gaps "-" and remove them. I am not well versed in BioPython AlignIO. I feel as though this problem is simple. Any guidance is greatly appreciated.
Unless you have a reference sequence, this is not a simple problem. Let's say you have 9 sequences that are around 100 residues each, and one that is homolgous to all of them but has an inserted domain that brings its size to 200 residues. When you align them, those 9 sequences would likely have at least 50% gaps and be removed. A potential workaround would be to first trim the columns that have majority of positions in gaps, and then remove sequences with 50% overall gaps. Everything should be realigned afterwards.
Thank you for your response. Yes, I first did remove columns that had a majority position in gaps with trimAL. It's the later I am having an issue with. I am uncertain of trimAL parameters to remove undesirable sequences and was, therefore, wondering if there was a simple way to write a script to remove sequence with 50% overall gaps.
See if the script below works for you. The alignment has to be in FASTA format, though it can probably be modified easily to take other formats as well.
Gap fraction cutoff must be entered on a 0-1 scale.
import sys
from Bio import SeqIO
FastaFile = open(sys.argv[1], 'r')
FastaDroppedFile = open(sys.argv[2], 'w')
drop_cutoff = float(sys.argv[3])
if (drop_cutoff > 1) or (drop_cutoff < 0):
print('\n Sequence drop cutoff must be in 0-1 range !\n')
sys.exit(1)
for seqs in SeqIO.parse(FastaFile, 'fasta'):
name = seqs.id
seq = seqs.seq
seqLen = len(seqs)
gap_count = 0
for z in range(seqLen):
if seq[z]=='-':
gap_count += 1
if (gap_count/float(seqLen)) >= drop_cutoff:
print(' %s was removed.' % name)
else:
SeqIO.write(seqs, FastaDroppedFile, 'fasta')
FastaFile.close()
FastaDroppedFile.close()
Thank you very kindly for your response!! Very helpful indeed! I had some computer issues and am now re-installing biopython and other programs! Will test your code and report back once I have everything up and running! Greatly appreciate you taking the time!
If I wanted to do something similar but for AA ambiguous codes, how would I change this?
For example, I want to run this on thousands of alignments which all have 103 samples. But some samples are represented by >75% of "X"s rather than gaps.
This worked, thank you! Now that outputs the files that were dropped to the command window.
I have a for loop to do this for 8,000 protein alignments (right now pointing to a test folder with 4 alignments):
for f in `ls test/*fasta`
do
python drop_fasta.py ${f} test-2/`basename ${f}` 0.5 >> Output-test.txt
done
I want to be able to get some stats on the samples that were removed by using this output. I'd like to be able to get a summary output file that counts how many samples were removed from all the alignments, what their names are, and how many times each of those samples were removed from alignments. Then I could just do some simple math and get the total number of alignments each sample is present in.
do you have a suggestion how to do this last step?
You will need to modify a script on your own to do exactly what you want. It shouldn't be difficult as most of the work is already done. You also may want to redirect the screen output into a file.
Unless you have a reference sequence, this is not a simple problem. Let's say you have 9 sequences that are around 100 residues each, and one that is homolgous to all of them but has an inserted domain that brings its size to 200 residues. When you align them, those 9 sequences would likely have at least 50% gaps and be removed. A potential workaround would be to first trim the columns that have majority of positions in gaps, and then remove sequences with 50% overall gaps. Everything should be realigned afterwards.
Thank you for your response. Yes, I first did remove columns that had a majority position in gaps with trimAL. It's the later I am having an issue with. I am uncertain of trimAL parameters to remove undesirable sequences and was, therefore, wondering if there was a simple way to write a script to remove sequence with 50% overall gaps.