Parts of sequences are given below-
Reference sequence (pre-alignment):
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGT
Reference sequence (post-alignment) and below it is Sample sequence (post-alignment):
--------------------------------------------------------------------------------------------------attaaa---------ggtt------------------tataccttc---------ccaggtaacaaa-------------ccaacc-----aactttcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgt
--------------------------------------------------------------------------------------------------------------------------------------------------------------taacaaa-------------ccaacc-----aactttcgatctcttgtagatctgttctctaaacgaactttaaaatctgtgt
I'm adding a simpler to interpret example as per a comment on this post.
Say the ref seq is aattaaatttgggggtttt
and the sample seq is ttaaggggttaaatttgggggt--t
. Then post-alignment, they will be like-
--aa----ttaaatttgggggtttt
ttaaggggttaaatttgggggt--t
Now, since my ref seq was
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
a a t t a a a t t t g g g g g t t t t
I want that post-alignment also, the indexing should be conserved-
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
- - a a - - - - t t a a a t t t g g g g g t t t t
t t a a g g g g t t a a a t t t g g g g g t - - t
I want to keep the indexing of the reference sequence conserved(i.e. the first base in ref seq post-alignment is a
, second is t
, third is t
, etc.), like they do in standard softwares, and then I want to run some quick analysis on it, say to check for conservation of a (mono/di)-nucleotide at some positions. If anybody has some insight on how to do it the most efficient way(memory-wise and time-wise), then that'd be great. I use Python for my work.
GenoMax As per your comment, I've updated the question. Please let me know if you still need more details from me.
This looks good. In what way do you want the indexing to be conserved? How would that be represented e.g. in an array? Are you doing this alignment yourself in python code?
I'm using
MAFFT
aligner for MSA. I wanted to know the tradeoffs of using different types of data structures for this kind of work given I've more than a million sequences and 8 GB Ram. It is either writing a code smartly or just breaking down the alignment output file into 100s of parts and then running the script on that. Then concatenating the output of the python script again(which I don't want to do).