Complex Genomic Regions
3
7
Entering edit mode
13.2 years ago

Is there any wiki/white paper available on what exactly are the "complex regions" in the genome? I have come across this term several times in sequencing studies as "regions with high/low complexity" , "difficult to sequence" and just could not figure out what that means.

Thanks!

sequence • 4.4k views
ADD COMMENT
7
Entering edit mode
13.2 years ago

DNA sequence complexity in modern terms has changed from its original meaning, which is part of the reason you are having trouble finding a clear definition.

The original usage of complexity relates to studies using reassociation kinetics to measure how repetitive/unique a genome was, through Cot curve analysis. In short these studies measured how fast denatured DNA reassociated, with faster reassociation impying increased repetetiveness. The "complexity" of a genome was then measured by the time at which half of the DNA was reassociated.

The more modern usage takes this original term and tries to apply it (loosely) to actual DNA sequences, with the same notion that more complex sequences have a higher degree of uniqueness, and vice versa. I am not sure that there is an widely-accepted definition of this modern usage applied to the "complexity" of DNA sequences. My suspicion is that this is operationally defined with respect to the algorithm used (simple sequence repeat detection, compression, etc.). A little google-ing found this definition that seems reasonable (but no reference is provided):

"The complexity of a sequence is defined as the longest non-repetitive sequence that can be derived from a sequence", e.g.

sequence     complexity    
TTTTTTTTTT      1
TATATATATA      2
TACTACTAC       3
TACGTACG      4
TACGGTACGG      5

HTH, Casey

ADD COMMENT
3
Entering edit mode
13.2 years ago

There is probably a lot more to it. But what I remember from Sanger sequencing is:

  • Areas that are not complex at all but very long (e.g. 800 G's), are hard since it is hard to decide on the exact length as soon as it is longer than a read.
  • Areas that share a loot of sequence with other areas are hard as well since you don't know which read comes from which area.
  • Since in Sanger sequencing you use clones to multiply sequence the sequence must actually allow the bacteria to grow. Sometimes that is not the case since the inserted DNA produces proteins that are toxic for the bacteria used. These last ones would of course just be hard to sequence, while they would probably be rather complex.
ADD COMMENT
3
Entering edit mode
13.2 years ago

As far as I remember, NCBI/Blast uses an algorithm known as the Shannon entropy to determine the complexity of a sequence.

ADD COMMENT

Login before adding your answer.

Traffic: 2533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6