Question

Randomised subsetting of sequences in a fasta file using R

1

Entering edit mode

10.0 years ago

confusedious ▴ 490

I have a sequence alignment in fasta format with 219 sequences in it. I am testing a new phylogenetic method and I am curious about how subsets of differing sizes and compositions from my full alignment might impact upon selection of sites for inclusion in tree building and thus tree topology.

I am using 'ape' and 'phangorn' in R and have found that I can subset defined sequences using the following method:

testalign <- read.phyDat("alignment.fasta", format = "fasta", type = "DNA")
subset(testalign, subset=1:10)

In this case I am creating a subset of sequences 1 through 10. Ideally I would like to extract subsets of this alignment of a random size between 3 and 218 and then write these subsets out as individual alignment files. I would prefer, of course, that these subsets not be taken in order of how they are found in the origianl file (i.e. not 1:10; 10 random sequences from the alignment of 219).

Could anyone advise on how I might achieve this?

fasta alignment R • 3.9k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by confusedious ▴ 490

Ram · Answer 1 · 2015-01-19

I don't have phangorn on the computer in front of me to test the whole thing, but you can get a random sample of integers with... sample() :)

sample(1:219, replace=TRUE, size=n)

Using replace=TRUE is equivalent to a bootstrap sample of size n.

You could do the same to sample from a uniform distribution of sample sizes (Ns <- sample(3:218, replace=TRUE, size=100)), or use sapply and replicate to repeatedly sample at each of several values for n.