Definition of palindrome sequence: In the palindrome structure of double-stranded DNA, there is (or no) palindrome structure in the central region, and the two side chains have a similar (similar degree>99.5%) base sequence in the 5'to 3'direction.
5'-CATCAGTTACAAT[****]ATTGTAACTGATG-3'
5'-GTAGTCAATGTTA[****]TAACATTGACTAC-3'
There are two test files: test1.fa and test2.fa. They are FASTA text format files with multiple 100bp DNA sequences. The format is as follows:
>seq1
ACTGATGTAG
Now I want to assemble long sequences using the short 100bp DNA sequences from test1.fa and test2.fa. Then, I'd like to find out the palindrome sequence from all these long sequences. Here, we believe that the sequence similarity on both sides of the central region is >99.5% can be regarded as a palindrome sequence.
Possible strategy: First, find out the palindrome short sequence in test file. Then use other short sequences to extend the chains on both sides of the palindrome sequence, and count the types of palindrome sequences that are finally assembled
How can I use Python to deal with this question?