Pegas import fasta and calculate theta
1
0
Entering edit mode
9.9 years ago
Adrian Pelin ★ 2.6k

Hello,

I have an alignment between 3 species at the codon level and would like to calculate theta. The "ape" part is easy:

a <- read.FASTA("M715_870002501/aligned_nt.fasta"

> class(a)
[1] "DNAbin"
> methods(class = "haplotype")
[1] [.haplotype*     plot.haplotype*  print.haplotype* sort.haplotype*

   Non-visible functions are asterisked
> class(a)
[1] "DNAbin"
> s <- length(seg.sites(a))
> s
[1] 547

However, I am not sure how to import this DNAbin format into pegas for the theta.s() function. The data function won't work:

> data(a)
Warning message:
In data(a) : data set \u2018a\u2019 not found

According to pegas docs, for theta.s:

Description
This function computes the population parameter THETA using the number of segregating sites
s - in a sample of
n - DNA sequences.
Usage
theta.s(s, n, variance = FALSE)
Arguments
s - a numeric giving the number of segregating sites.
n - a numeric giving the number of sequences.
variance - a logical indicating whether the variance of the estimated THETA should be
returned (TRUE), the default being FALSE.

Any ideas? I feel like solution is simple and I am missing it... do I just do theta.s(s, 3) ? Since there are 3 DNA sequences?

Adrian

pegas theta fasta R • 5.2k views
ADD COMMENT
1
Entering edit mode
9.9 years ago
David W 4.9k

Do I just do theta.s(s, 3) ? Since there are 3 DNA sequences

Yup. From ?theta.s

s    a numeric giving the number of segregating sites.
n    a numeric giving the number of sequences.
variance    a logical indicating whether the variance of the estimated THETA should be returned (TRUE), the default being FALSE.

You can make it a function on a dna bin:

theta_watterson <- function(d) length(seg.sites(d)) / sum(1/1:(nrow(d)-1))

Note, both approaches will give you Ne mutation rate per gene*, not per nucleotide.

ADD COMMENT
0
Entering edit mode

Thanks a lot for the help! By the way, how do I get it per nucleotide, that was my next question. I need to compare 2 estimators and one of them spits out theta per nucleotide so it would make sense to normalize this one relative to length as well. This DNAbin format is driving me crazy :) I know it is a matter of just dividing by length of the alignment, but I cannot get the length into a variable. Also, I need to divide the variance by length as well, correct?

ADD REPLY
0
Entering edit mode

Just the normal -- , ncol (or dim). You should check how the other estimator deals with missing data (if you have any) if you wan to make a fair comparison.

ADD REPLY
0
Entering edit mode

I am looking at complete ORFs so there should not be any missing data. Oddly, the only way to get alignment length is length(nchar(as.character(a$ECU01_1130))), which is pretty messy and inconvenient since one needs to supply at least one of the sequence names.

> a
3 DNA sequences in binary format stored in a list.

All sequences of same length: 1554

Labels: orf00001_C112950_F_131_1612 NBO_26g0012 ECU01_1130

Base composition:
    a     c     g     t
0.337 0.172 0.216 0.275
> length(nchar(as.character(a$ECU01_1130)))
[1] 1554
> ncol(a)
NULL
> dim(a)
NULL
ADD REPLY
0
Entering edit mode

You are using read.FASTA to get the files in, which returns a list of characters. If you use read.dna( ... , format='"fasta") you should get a matrix that nrow/ncol/dim and the rest should work on.

ADD REPLY
0
Entering edit mode

In the ape manual, read.FASTA is given as an example in the read.dna function, which made me think read.FASTA is a shortcut for format="fasta". Can't believe this was the problem. Thank you so much for your help!

ADD REPLY

Login before adding your answer.

Traffic: 1904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6