shannon entropy score
7
3
Entering edit mode
8.0 years ago

Hi all,

I'm looking to determinate shannon entropy score for a short sequence corresponding for an hyper-variable region, the idea is to compare this region for different samples. Any experience with that?

sequence sequencing • 5.4k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
2
Entering edit mode
8.0 years ago
Joseph Hughes ★ 3.0k

There is an R package called entropy.

ADD COMMENT
1
Entering edit mode

Another R package could be infotheo.

ADD REPLY
2
Entering edit mode
6.8 years ago
sacha ★ 2.4k

'seqtk comp' command return #A,#C,#G,#T composition.
With the following fasta file :

>seq1
AAAA
>seq2
ATCGACTTTTTTGTAGTACGTA

You can run this oneliner to get Shannon entropy score for each sequence in your fasta.

seqtk comp test.fa|awk '{for(i=3;i<=6;i++){if($i){H+=$i/$2*log($i/$2)/log(2)}}print $1,-H}'

which return :

seq1 0
seq2 1.84199
ADD COMMENT
1
Entering edit mode
8.0 years ago
Gabriel R. ★ 2.9k

Here is a C++ implementation:

https://github.com/grenaud/aLib/blob/0785cd32c32bd8b515b3a79daff4897833b0b63c/pipeline/filterReads.cpp

It hasn't been used/tested extensively but feel free to use the code.

ADD COMMENT
1
Entering edit mode
8.0 years ago

BBDuk calculates Shannon entropy, and can pass or fail sequences based on the score. For example:

bbduk.sh in=sequences.fa out=pass.fa outm=fail.fa entropy=0.9 entropywindow=50 entropyk=5

The code is in BBDukF.java in the function averageEntropy().

ADD COMMENT
1
Entering edit mode
6.8 years ago
vmicrobio ▴ 290

Give a try to biojava:

import java.util.*;

import org.biojava.bio.dist.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;

public class Entropy {
   public static void main(String[] args) {

      Distribution dist = null;
      try {
      //create a biased distribution
          dist =
               DistributionFactory.DEFAULT.createDistribution(DNATools.getDNA());

      //set the weight of a to 0.97
      dist.setWeight(DNATools.a(), 0.97);

      //set the others to 0.01
      dist.setWeight(DNATools.c(), 0.01);
      dist.setWeight(DNATools.g(), 0.01);
      dist.setWeight(DNATools.t(), 0.01);
   }
   catch (Exception ex) {
   ex.printStackTrace();
   System.exit(-1);
}

    //calculate the information content
    double info = DistributionTools.bitsOfInformation(dist);
    System.out.println("information = "+info+" bits");
    System.out.print("\n");

    //calculate the Entropy (using the conventional log base of 2)
    HashMap entropy = DistributionTools.shannonEntropy(dist, 2.0);

    //print the Entropy of each residue
    System.out.println("Symbol\tEntropy");
    for (Iterator i = entropy.keySet().iterator(); i.hasNext(); ) {
      Symbol sym = (Symbol)i.next();
      System.out.println(sym.getName()+ "\t" +entropy.get(sym));
    }
  }
}
ADD COMMENT
0
Entering edit mode
8.0 years ago

If found this on the net. Next step would be to implement it for a NGS use

http://code.activestate.com/recipes/577476-shannon-entropy-calculation/

# Shannon Entropy of a string
# = minimum average number of bits per symbol
# required for encoding the string
#
# So the theoretical limit for data compression:
# Shannon Entropy of the string * string length
# FB - 201011291
import math
from sets import Set

st = 'acgtaggatcccctat' # input string
# st = '00010101011110' # Shannon entropy for 'aabcddddefffg' would be 1 bit/symbol

print 'Input string:'
print st
print
stList = list(st)
alphabet = list(Set(stList)) # list of symbols in the string
print 'Alphabet of symbols in the string:'
print alphabet
print
# calculate the frequency of each symbol in the string
freqList = []
for symbol in alphabet:
    ctr = 0
    for sym in stList:
        if sym == symbol:
            ctr += 1
    freqList.append(float(ctr) / len(stList))
print 'Frequencies of alphabet symbols:'
print freqList
print
# Shannon entropy
ent = 0.0
for freq in freqList:
    ent = ent + freq * math.log(freq, 2)
ent = -ent
print 'Shannon entropy:'
print ent
print 'Minimum number of bits required to encode each symbol:'
print int(math.ceil(ent))
ADD COMMENT
0
Entering edit mode

What do you mean adapt for NGS use? Are you wanting to calculate the entropy on a per site basis on the genome, based on the bases in reads that are aligned to that position?

ADD REPLY
0
Entering edit mode

the objective would be to do it on a defined region that could be (or not) aligned to a reference

ADD REPLY
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

Sorry, but how's this relevant to OP's question?

ADD REPLY

Login before adding your answer.

Traffic: 2152 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6