Hello,
I have an alignment between 3 species at the codon level and would like to calculate theta. The "ape" part is easy:
a <- read.FASTA("M715_870002501/aligned_nt.fasta"
> class(a)
[1] "DNAbin"
> methods(class = "haplotype")
[1] [.haplotype* plot.haplotype* print.haplotype* sort.haplotype*
Non-visible functions are asterisked
> class(a)
[1] "DNAbin"
> s <- length(seg.sites(a))
> s
[1] 547
However, I am not sure how to import this DNAbin format into pegas for the theta.s()
function. The data function won't work:
> data(a)
Warning message:
In data(a) : data set \u2018a\u2019 not found
According to pegas docs, for theta.s:
Description
This function computes the population parameter THETA using the number of segregating sites
s - in a sample of
n - DNA sequences.
Usage
theta.s(s, n, variance = FALSE)
Arguments
s - a numeric giving the number of segregating sites.
n - a numeric giving the number of sequences.
variance - a logical indicating whether the variance of the estimated THETA should be
returned (TRUE), the default being FALSE.
Any ideas? I feel like solution is simple and I am missing it... do I just do theta.s(s, 3) ? Since there are 3 DNA sequences?
Adrian
Thanks a lot for the help! By the way, how do I get it per nucleotide, that was my next question. I need to compare 2 estimators and one of them spits out theta per nucleotide so it would make sense to normalize this one relative to length as well. This DNAbin format is driving me crazy :) I know it is a matter of just dividing by length of the alignment, but I cannot get the length into a variable. Also, I need to divide the variance by length as well, correct?
Just the normal -- , ncol (or dim). You should check how the other estimator deals with missing data (if you have any) if you wan to make a fair comparison.
I am looking at complete ORFs so there should not be any missing data. Oddly, the only way to get alignment length is
length(nchar(as.character(a$ECU01_1130)))
, which is pretty messy and inconvenient since one needs to supply at least one of the sequence names.You are using
read.FASTA
to get the files in, which returns a list of characters. If you useread.dna( ... , format='"fasta")
you should get a matrix thatnrow
/ncol
/dim
and the rest should work on.In the ape manual,
read.FASTA
is given as an example in theread.dna
function, which made me thinkread.FASTA
is a shortcut forformat="fasta"
. Can't believe this was the problem. Thank you so much for your help!