How does Burrows-Wheeler Transformation work on genetic sequences?
1
1
Entering edit mode
6.7 years ago

Hi -

So I'm just wondering how exactly the BWT works in creating "runs" of the same character in the last column after sorting the cycled string. Is it because of some frequency of nucleotides occurring conditionally that I'm not aware of? I get that in the English language I'm pretty sure that there are cases where a character is more likely to appear after another character (that's how BWT's effectiveness is explained in all texts i've found so far), but does this also apply to nucleotides?

Burrows-Wheeler Transform Sequences Compression • 1.7k views
ADD COMMENT
4
Entering edit mode
6.7 years ago
kloetzl ★ 1.1k

You are right, the BWT will have a hard time compressing random sequences of nucleotides as (uniform) random data is by definition hard to compress. However, genetic sequences are far from random. Just think of GC-bias, codon-bias, motifs, patterns in various forms (promotors, TATA-box), duplications, …. All of these reduce the "randomness" (entropy) of the data and instead increase the repetitiveness which the BWT then can exploit.

ADD COMMENT
2
Entering edit mode

Also, a 4 symbol alphabet helps a lot in keeping data structures small.

ADD REPLY

Login before adding your answer.

Traffic: 1786 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6