Is there a way or a program/module that I can use to compute the complexity score for the given sequences? I want to rank sequences by their complexity.
Is there a way or a program/module that I can use to compute the complexity score for the given sequences? I want to rank sequences by their complexity.
The software preseq was designed for just this purpose. From their website,
The preseq package is aimed at predicting and estimating the complexity of a genomic sequencing library, equivalent to predicting and estimating the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing using an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
The publication, Predicting the molecular complexity of sequencing libraries, is in Nature Methods.
Perhaps this Perl resource will be helpful to you: Algorithms to compute DNA complexity.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Compressibility is related to the complexity of a sequence. Getting a high compression ratio (smaller file) would mean your sequence is not very complex. You can try using some standard compression algorithms and check the file size.