I'd like to perform multiple alignments, where a gene from each sample was read in from fasta files. The fasta file represented one sample and had multiple genes. I have read in each sample fasta file and now have a dictionary of genes and their samples and sequences. Here is a small example.
The main key is the gene ID for another dictionary as the value. In the dictionary 'R??' is the sample name and key, followed by the annotation and sequence in a list.
{'PRELSG_MIT_v1:5894-6032': {'R47': ['not annotated',
Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
'R48': ['not annotated',
Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
'R49': ['not annotated',
Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
'R50': ['not annotated',
Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
'R51': ['not annotated',
Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
'R52': ['not annotated',
Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')]},
'exon_PRELSG_MIT00300.1-E1': {'R47': ['cytochrome b',
Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
'R48': ['cytochrome b',
Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
'R49': ['cytochrome b',
Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
'R50': ['cytochrome b',
Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
'R51': ['cytochrome b',
Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
'R52': ['cytochrome b',
Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')]}}
I'd like to align each gene and write out those alignments to file without any intermediate files written to disk.
How do I do that?
Is there a specific reason you stored everything as dicts rather than just using one of the SeqIO methods (or did you use
SeqIO.to_dict()
)?Generally speaking, other than for pairwise alignment there isn't really a benefit for aligning multiple sequences 'with' python, since they just call out to proper MSA tools like CLUSTAL and MUSCLE anyway, and you could just feed the files in via
AlignIO
.I used Seq.IO parse to read it into a dict, because that way I can store info about samples. I am also fetching data from a gff file because the fasta files only had coordinates for each gene. I want to do it in Python because I don't want to write intermediate files, keep it all in memory until I'm ready to write the alignments to file.
I don't think there is any multiple sequence alignment implementation in Biopython that won't use intermediate files (but will gladly be corrected). You can get around some of the hassle by making use of the
tempfile
module, but it will still create temporary files on disk.MSA can be very time and memory intensive, and python really isn't cut out for that type of computation.
I'd still use something like muscle or CLUSTAL to do the alignments. Thanks though.
The same applies - python will shell out to those applications behind the scenes and I'm pretty sure it will still create intermediate files that may need reading in again