Randomised subsetting of sequences in a fasta file using R
1
1
Entering edit mode
9.9 years ago
confusedious ▴ 490

I have a sequence alignment in fasta format with 219 sequences in it. I am testing a new phylogenetic method and I am curious about how subsets of differing sizes and compositions from my full alignment might impact upon selection of sites for inclusion in tree building and thus tree topology.

I am using 'ape' and 'phangorn' in R and have found that I can subset defined sequences using the following method:

testalign <- read.phyDat("alignment.fasta", format = "fasta", type = "DNA")
subset(testalign, subset=1:10)

In this case I am creating a subset of sequences 1 through 10. Ideally I would like to extract subsets of this alignment of a random size between 3 and 218 and then write these subsets out as individual alignment files. I would prefer, of course, that these subsets not be taken in order of how they are found in the origianl file (i.e. not 1:10; 10 random sequences from the alignment of 219).

Could anyone advise on how I might achieve this?

fasta alignment R • 3.9k views
ADD COMMENT
0
Entering edit mode
9.9 years ago
David W 4.9k

I don't have phangorn on the computer in front of me to test the whole thing, but you can get a random sample of integers with... sample() :)

sample(1:219, replace=TRUE, size=n)

Using replace=TRUE is equivalent to a bootstrap sample of size n.

You could do the same to sample from a uniform distribution of sample sizes (Ns <- sample(3:218, replace=TRUE, size=100)), or use sapply and replicate to repeatedly sample at each of several values for n.

ADD COMMENT

Login before adding your answer.

Traffic: 1831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6