Use biopython to align SeqRecords stored in dict
1
0
Entering edit mode
3.1 years ago
Glubbdrubb • 0

I'd like to perform multiple alignments, where a gene from each sample was read in from fasta files. The fasta file represented one sample and had multiple genes. I have read in each sample fasta file and now have a dictionary of genes and their samples and sequences. Here is a small example.

The main key is the gene ID for another dictionary as the value. In the dictionary 'R??' is the sample name and key, followed by the annotation and sequence in a list.

{'PRELSG_MIT_v1:5894-6032': {'R47': ['not annotated',
                                     Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
                             'R48': ['not annotated',
                                     Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
                             'R49': ['not annotated',
                                     Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
                             'R50': ['not annotated',
                                     Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
                             'R51': ['not annotated',
                                     Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')],
                             'R52': ['not annotated',
                                     Seq('ATAAAAATAGTTACCATAGCTGTTGATGGATGCTTCGATCATAAAATATGTTTG...GGA')]},
 'exon_PRELSG_MIT00300.1-E1': {'R47': ['cytochrome b',
                                       Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
                               'R48': ['cytochrome b',
                                       Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
                               'R49': ['cytochrome b',
                                       Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
                               'R50': ['cytochrome b',
                                       Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
                               'R51': ['cytochrome b',
                                       Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')],
                               'R52': ['cytochrome b',
                                       Seq('TCCAGATAATGCTATTACAGTAGATAGATATGCTACTCCTTTACATATTGTTCC...TTC')]}}

I'd like to align each gene and write out those alignments to file without any intermediate files written to disk.
How do I do that?

Biopython multiple-alignment • 1.6k views
ADD COMMENT
0
Entering edit mode

Is there a specific reason you stored everything as dicts rather than just using one of the SeqIO methods (or did you use SeqIO.to_dict())?

Generally speaking, other than for pairwise alignment there isn't really a benefit for aligning multiple sequences 'with' python, since they just call out to proper MSA tools like CLUSTAL and MUSCLE anyway, and you could just feed the files in via AlignIO.

ADD REPLY
0
Entering edit mode

I used Seq.IO parse to read it into a dict, because that way I can store info about samples. I am also fetching data from a gff file because the fasta files only had coordinates for each gene. I want to do it in Python because I don't want to write intermediate files, keep it all in memory until I'm ready to write the alignments to file.

ADD REPLY
0
Entering edit mode

I don't think there is any multiple sequence alignment implementation in Biopython that won't use intermediate files (but will gladly be corrected). You can get around some of the hassle by making use of the tempfile module, but it will still create temporary files on disk.

MSA can be very time and memory intensive, and python really isn't cut out for that type of computation.

ADD REPLY
0
Entering edit mode

I'd still use something like muscle or CLUSTAL to do the alignments. Thanks though.

ADD REPLY
0
Entering edit mode

The same applies - python will shell out to those applications behind the scenes and I'm pretty sure it will still create intermediate files that may need reading in again

ADD REPLY
1
Entering edit mode
3.1 years ago

The requirement of "no intermediate files" feels superfluous and leads to unneeded complications. There is no reason to eagerly avoid that.

As a matter of fact, it is desirable to have intermediate files to be able to investigate the results, often things do go in unexpected ways.

In this case, the problem is simple, save your file as multi-fasta, run clustalw then parse the resulting file. Use BioPython as the tool that glues the various information together. That is what BioPython was designed for.

That is the normal, standard operation protocol.

ADD COMMENT

Login before adding your answer.

Traffic: 1480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6