I've been racking my brains about the best way to do this for a while now but am no closer to the answer it seems, so I could use some help!
I'll explain the problem conceptually, then in context as that may help. Essentially the problem is this:
Given a pair of sequences, at least one of which has per-letter annotations
in biopython
, I need to align the shorter of the 2 sequences (since one is likely to have gaps, where as the annotated sequence will be 'full'), and be able to 'carry' the per-letter annotations with it.
For example, if I have these 2 dummy sequences (with some scores immediately below the longer sequence):
>1
ATGCAT
135198
>2
ATCAT
In practice these scores are not all ints
and will be floats
, but for the purpose of demonstration random numbers will suffice.
Now if these sequences were aligned, and the scores were carried/transferred to the other sequence we'd get:
ATGCAT
AT-CAT
13-198
So the question is, how can I align these in general, and preserve these annotations? So far, I've been looking to see if pairwise2
can handle SeqRecords and deal with moving those characters around as the object would preserve the character-score relationship, but it seems to only deal with literal sequence strings. In the above case, it would be pretty trivial to copy the annotations over to the new sequence and then just 'drop' any position where there is a gap.
In more complicated real-world examples though, I'm expecting that this logic may not always hold, but I am still trying to kind of tease this thread through in my head to understand whether I'm overcomplicating things.
I said I'd offer the real world motivation for this too, so the reason I need to do this is that I have a number of fasta files which have been analysed for their immunogenicity properties, and I am in the process of 'rendering' this on to the 3D structure. Its common for 3D structures to be missing stretches of residues though, so in order to correctly 'associate' the sequence with its structure, and the right residues together, I need to be able to align them and (as far as possible) unambiguously identify which residue in the structure corresponds to the value in the fasta/scores file.
Thoughts/feelings/code gratefully recieved.
Re-posted it at SO with R/Bioconductor tags: https://stackoverflow.com/q/59682107/680068
I don't think there is yet an elegant way to do this in Biopython. In theory the
MutableSeq
class would be well suited for this task, since it handles sequences that are not single characters, but in practise it's not currently maintained in a way that could be used. You can at least use the newerBio.Align
module rather thanpairwise2
Yeah I did take a look at the
PairwiseAligner()
but it doesn't seem to offer any additional functionality.I'm actually a bit surprised that the alignment modules don't ingest SeqRecords since that's pretty much the primary currency of biopython.
Shame. Perhaps there's a feature request in the making if its confirmed there's definitely no elegant way of doing it. I thought there might be since when it comes to slicing SeqRecords, it's quite intelligent in how it handles features, annotations and per-letter annotations.
The
PairwiseAligner()
offers some different alignment algorithms and a fast C-implementation, but yes it does not solve your issues. The alignment module will 'ingest'Seq
/MutableSeq
/UnknownSeq
objects (not just str) and I'm not sure it would make sense for them to ingestSeqRecord
s since alignment is purely based on the sequenceWhat is the input format, since it holds floats, it can't be the modified FASTA format you posted? If there is a standard input format, then this could possibly be an enhancement for Biopython to read via SeqIO, then it could create a SeqRecord with letter annotations as you suggested. (However, you'll still have to do some work to parse the resulting alignment and map the score annotations back)
At the moment the input format is entirely undefined and can be whatever I want it to be. The fasta above was just a clear example
One could even encode it as a sort of fastq style format, however the real values are floats as I mentioned so they cannot be encoded in a single character easily (unlike, say fastq scores are).
Biopython's per letter annotations can simply be set as
SeqRecord.letter_annotations["Name"] = [...]
so the data format can be pretty flexible as all you need is a list.