Question

Found out palindrome sequences after assembling sequences*Python*

0

Entering edit mode

4.9 years ago

anran04100 • 0

Definition of palindrome sequence： In the palindrome structure of double-stranded DNA, there is （or no) palindrome structure in the central region, and the two side chains have a similar (similar degree>99.5%) base sequence in the 5'to 3'direction.

5'-CATCAGTTACAAT[****]ATTGTAACTGATG-3'
5'-GTAGTCAATGTTA[****]TAACATTGACTAC-3'

There are two test files: test1.fa and test2.fa. They are FASTA text format files with multiple 100bp DNA sequences. The format is as follows:

>seq1

ACTGATGTAG

Now I want to assemble long sequences using the short 100bp DNA sequences from test1.fa and test2.fa. Then, I'd like to find out the palindrome sequence from all these long sequences. Here, we believe that the sequence similarity on both sides of the central region is >99.5% can be regarded as a palindrome sequence.

Possible strategy: First, find out the palindrome short sequence in test file. Then use other short sequences to extend the chains on both sides of the palindrome sequence, and count the types of palindrome sequences that are finally assembled

How can I use Python to deal with this question？

next-gen assembly sequence • 1.1k views

ADD COMMENT • link updated 4.9 years ago by Mensur Dlakic ★ 29k • written 4.9 years ago by anran04100 • 0

score 1 · Accepted Answer · 2020-09-01

What you suggest as a palindrome is actually a pseudo-palindrome. Also, for short palindromes like the ones you are showing, in practical terms there is no such a thing as similar degree>99.5%. One mismatch in a 20-bp long sequence means that their identity drops to 95%, so you would have to have a palindrome that is >200-bp long before >99.5% identity would even come into consideration. If you want the identity to be >99.5%, for practical purposed you might as well go with 100% because it won't make any difference

I don't have a good feel for what you are trying to do, so my suggestions may be off. But with what I understand, I would suggest to do your search on assembled sequences.

Here is a short python code, of which portions may be helpful. Beyond that, you can use Biopython to read the sequences, and then write your own routines to check for pseudo-palindromes. Beware that you this will become very complex unless: 1) you define the length of the middle region to be a relatively small number; 2) if you are using this search for finding protein-binding sites, then the middle region should be allowed to have both odd and even number of bases.

If you don't do (1), this will become very complex and time consuming, and you may end up with meaningless results. Unless you are looking for inverted repeats, it is probably useless if you find a palindrome separated by 15 kB of sequences. You will find bunch of such examples if you don't limit the size of the middle region.