Hi all,
I'm looking to determinate shannon entropy score for a short sequence corresponding for an hyper-variable region, the idea is to compare this region for different samples. Any experience with that?
Hi all,
I'm looking to determinate shannon entropy score for a short sequence corresponding for an hyper-variable region, the idea is to compare this region for different samples. Any experience with that?
There is an R package called entropy.
'seqtk comp' command return #A,#C,#G,#T composition.
With the following fasta file :
>seq1
AAAA
>seq2
ATCGACTTTTTTGTAGTACGTA
You can run this oneliner to get Shannon entropy score for each sequence in your fasta.
seqtk comp test.fa|awk '{for(i=3;i<=6;i++){if($i){H+=$i/$2*log($i/$2)/log(2)}}print $1,-H}'
which return :
seq1 0
seq2 1.84199
Here is a C++ implementation:
It hasn't been used/tested extensively but feel free to use the code.
BBDuk calculates Shannon entropy, and can pass or fail sequences based on the score. For example:
bbduk.sh in=sequences.fa out=pass.fa outm=fail.fa entropy=0.9 entropywindow=50 entropyk=5
The code is in BBDukF.java in the function averageEntropy().
Give a try to biojava:
import java.util.*;
import org.biojava.bio.dist.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.symbol.*;
public class Entropy {
public static void main(String[] args) {
Distribution dist = null;
try {
//create a biased distribution
dist =
DistributionFactory.DEFAULT.createDistribution(DNATools.getDNA());
//set the weight of a to 0.97
dist.setWeight(DNATools.a(), 0.97);
//set the others to 0.01
dist.setWeight(DNATools.c(), 0.01);
dist.setWeight(DNATools.g(), 0.01);
dist.setWeight(DNATools.t(), 0.01);
}
catch (Exception ex) {
ex.printStackTrace();
System.exit(-1);
}
//calculate the information content
double info = DistributionTools.bitsOfInformation(dist);
System.out.println("information = "+info+" bits");
System.out.print("\n");
//calculate the Entropy (using the conventional log base of 2)
HashMap entropy = DistributionTools.shannonEntropy(dist, 2.0);
//print the Entropy of each residue
System.out.println("Symbol\tEntropy");
for (Iterator i = entropy.keySet().iterator(); i.hasNext(); ) {
Symbol sym = (Symbol)i.next();
System.out.println(sym.getName()+ "\t" +entropy.get(sym));
}
}
}
If found this on the net. Next step would be to implement it for a NGS use
http://code.activestate.com/recipes/577476-shannon-entropy-calculation/
# Shannon Entropy of a string
# = minimum average number of bits per symbol
# required for encoding the string
#
# So the theoretical limit for data compression:
# Shannon Entropy of the string * string length
# FB - 201011291
import math
from sets import Set
st = 'acgtaggatcccctat' # input string
# st = '00010101011110' # Shannon entropy for 'aabcddddefffg' would be 1 bit/symbol
print 'Input string:'
print st
print
stList = list(st)
alphabet = list(Set(stList)) # list of symbols in the string
print 'Alphabet of symbols in the string:'
print alphabet
print
# calculate the frequency of each symbol in the string
freqList = []
for symbol in alphabet:
ctr = 0
for sym in stList:
if sym == symbol:
ctr += 1
freqList.append(float(ctr) / len(stList))
print 'Frequencies of alphabet symbols:'
print freqList
print
# Shannon entropy
ent = 0.0
for freq in freqList:
ent = ent + freq * math.log(freq, 2)
ent = -ent
print 'Shannon entropy:'
print ent
print 'Minimum number of bits required to encode each symbol:'
print int(math.ceil(ent))
Please see the publication http://bioinformatics.oxfordjournals.org/content/23/15/1875.full.pdf
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Entropy Of Dna Sequences
Calculating Shannon Entropy for DNA sequence?: http://math.stackexchange.com/questions/1405130/calculating-shannon-entropy-for-dna-sequence