Question

shannon entropy score

3

Entering edit mode

8.6 years ago

curiousbiologist ▴ 40

Hi all,

I'm looking to determinate shannon entropy score for a short sequence corresponding for an hyper-variable region, the idea is to compare this region for different samples. Any experience with that?

sequence sequencing • 6.0k views

ADD COMMENT • link updated 6.4 years ago by Biostar 20 • written 8.6 years ago by curiousbiologist ▴ 40

1

Entering edit mode

Entropy Of Dna Sequences

Calculating Shannon Entropy for DNA sequence?: http://math.stackexchange.com/questions/1405130/calculating-shannon-entropy-for-dna-sequence

ADD REPLY • link 8.6 years ago by Tonor ▴ 480

score 2 · Answer 1 · 2016-11-24

2

Entering edit mode

8.6 years ago

Joseph Hughes ★ 3.0k

There is an R package called entropy.

ADD COMMENT • link 8.6 years ago by Joseph Hughes ★ 3.0k

1

Entering edit mode

Another R package could be infotheo.

ADD REPLY • link 8.6 years ago by ddiez ★ 2.0k

score 2 · Answer 2 · 2018-02-13

'seqtk comp' command return #A,#C,#G,#T composition.
With the following fasta file :

>seq1
AAAA
>seq2
ATCGACTTTTTTGTAGTACGTA

You can run this oneliner to get Shannon entropy score for each sequence in your fasta.

seqtk comp test.fa|awk '{for(i=3;i<=6;i++){if($i){H+=$i/$2*log($i/$2)/log(2)}}print $1,-H}'

which return :

seq1 0
seq2 1.84199

score 1 · Answer 3 · 2016-11-24

1

Entering edit mode

8.6 years ago

Gabriel R. ★ 2.9k

Here is a C++ implementation:

https://github.com/grenaud/aLib/blob/0785cd32c32bd8b515b3a79daff4897833b0b63c/pipeline/filterReads.cpp

It hasn't been used/tested extensively but feel free to use the code.

ADD COMMENT • link 8.6 years ago by Gabriel R. ★ 2.9k

score 1 · Answer 4 · 2016-11-24

1

Entering edit mode

8.6 years ago

Brian Bushnell 20k

BBDuk calculates Shannon entropy, and can pass or fail sequences based on the score. For example:

bbduk.sh in=sequences.fa out=pass.fa outm=fail.fa entropy=0.9 entropywindow=50 entropyk=5

The code is in BBDukF.java in the function averageEntropy().

ADD COMMENT • link 8.6 years ago by Brian Bushnell 20k

score 1 · Answer 5 · 2018-02-13

Give a try to biojava:

import java.util.*;

import org.biojava.bio.dist.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;

public class Entropy {
   public static void main(String[] args) {

      Distribution dist = null;
      try {
      //create a biased distribution
          dist =
               DistributionFactory.DEFAULT.createDistribution(DNATools.getDNA());

      //set the weight of a to 0.97
      dist.setWeight(DNATools.a(), 0.97);

      //set the others to 0.01
      dist.setWeight(DNATools.c(), 0.01);
      dist.setWeight(DNATools.g(), 0.01);
      dist.setWeight(DNATools.t(), 0.01);
   }
   catch (Exception ex) {
   ex.printStackTrace();
   System.exit(-1);
}

    //calculate the information content
    double info = DistributionTools.bitsOfInformation(dist);
    System.out.println("information = "+info+" bits");
    System.out.print("\n");

    //calculate the Entropy (using the conventional log base of 2)
    HashMap entropy = DistributionTools.shannonEntropy(dist, 2.0);

    //print the Entropy of each residue
    System.out.println("Symbol\tEntropy");
    for (Iterator i = entropy.keySet().iterator(); i.hasNext(); ) {
      Symbol sym = (Symbol)i.next();
      System.out.println(sym.getName()+ "\t" +entropy.get(sym));
    }
  }
}

score 0 · Answer 6 · 2016-11-24

If found this on the net. Next step would be to implement it for a NGS use

http://code.activestate.com/recipes/577476-shannon-entropy-calculation/

# Shannon Entropy of a string
# = minimum average number of bits per symbol
# required for encoding the string
#
# So the theoretical limit for data compression:
# Shannon Entropy of the string * string length
# FB - 201011291
import math
from sets import Set

st = 'acgtaggatcccctat' # input string
# st = '00010101011110' # Shannon entropy for 'aabcddddefffg' would be 1 bit/symbol

print 'Input string:'
print st
print
stList = list(st)
alphabet = list(Set(stList)) # list of symbols in the string
print 'Alphabet of symbols in the string:'
print alphabet
print
# calculate the frequency of each symbol in the string
freqList = []
for symbol in alphabet:
    ctr = 0
    for sym in stList:
        if sym == symbol:
            ctr += 1
    freqList.append(float(ctr) / len(stList))
print 'Frequencies of alphabet symbols:'
print freqList
print
# Shannon entropy
ent = 0.0
for freq in freqList:
    ent = ent + freq * math.log(freq, 2)
ent = -ent
print 'Shannon entropy:'
print ent
print 'Minimum number of bits required to encode each symbol:'
print int(math.ceil(ent))

score 0 · Answer 7 · 2016-11-24

0

Entering edit mode

8.6 years ago

ahmedakhokhar ▴ 150

Please see the publication http://bioinformatics.oxfordjournals.org/content/23/15/1875.full.pdf