Question

Complex Genomic Regions

7

Entering edit mode

13.1 years ago

learnerforever ▴ 520

Is there any wiki/white paper available on what exactly are the "complex regions" in the genome? I have come across this term several times in sequencing studies as "regions with high/low complexity" , "difficult to sequence" and just could not figure out what that means.

Thanks!

sequence • 4.4k views

ADD COMMENT • link updated 13.1 years ago by Casey Bergman 18k • written 13.1 years ago by learnerforever ▴ 520

score 7 · Answer 1 · 2011-10-13

DNA sequence complexity in modern terms has changed from its original meaning, which is part of the reason you are having trouble finding a clear definition.

The original usage of complexity relates to studies using reassociation kinetics to measure how repetitive/unique a genome was, through Cot curve analysis. In short these studies measured how fast denatured DNA reassociated, with faster reassociation impying increased repetetiveness. The "complexity" of a genome was then measured by the time at which half of the DNA was reassociated.

The more modern usage takes this original term and tries to apply it (loosely) to actual DNA sequences, with the same notion that more complex sequences have a higher degree of uniqueness, and vice versa. I am not sure that there is an widely-accepted definition of this modern usage applied to the "complexity" of DNA sequences. My suspicion is that this is operationally defined with respect to the algorithm used (simple sequence repeat detection, compression, etc.). A little google-ing found this definition that seems reasonable (but no reference is provided):

"The complexity of a sequence is defined as the longest non-repetitive sequence that can be derived from a sequence", e.g.

sequence     complexity    
TTTTTTTTTT      1
TATATATATA      2
TACTACTAC       3
TACGTACG      4
TACGGTACGG      5

HTH, Casey

score 3 · Answer 2 · 2011-10-12

There is probably a lot more to it. But what I remember from Sanger sequencing is:

Areas that are not complex at all but very long (e.g. 800 G's), are hard since it is hard to decide on the exact length as soon as it is longer than a read.
Areas that share a loot of sequence with other areas are hard as well since you don't know which read comes from which area.
Since in Sanger sequencing you use clones to multiply sequence the sequence must actually allow the bacteria to grow. Sometimes that is not the case since the inserted DNA produces proteins that are toxic for the bacteria used. These last ones would of course just be hard to sequence, while they would probably be rather complex.

score 3 · Answer 3 · 2011-10-13

3

Entering edit mode

13.1 years ago

Pierre Lindenbaum 164k

As far as I remember, NCBI/Blast uses an algorithm known as the Shannon entropy to determine the complexity of a sequence.

ADD COMMENT • link 13.1 years ago by Pierre Lindenbaum 164k