Entering edit mode
8.1 years ago
Macherki M E
▴
120
I used coding DNA sequence (CDs) to predict the average of the protein size of E.coli k12 from a sample provided by the seqinr
R package ,data ec999
. The size of each sequence is stored in variable x
.
I used two histograms to fit the distribution of this variable and their 10 base logarithm.
I suggest that the log linear distribution is more reasonable.The average E can be calculated then:
E1=mean(x)/3
E2=exp(mean(log(x)))/3
- I have to classify CDS data from different genome, shall I use E2 and theirs standard deviation as parameters?
reasonable for what?
Classification with respect to what kind of annotation or classes? If the CDS is correct then then length(AA) == length(CDS)/3. You can also simply use the know protein lengths from the genome annotation.