Question

Multiple sequence alignment statistics

1

Entering edit mode

5.5 years ago

Anand Rao ▴ 640

I have ~ 18000 files, each with clustal formatted protein alignments derived from Pfam-A.full. Some of these files are large > 500MB in size, the largest alignment is 3GB!

I need to calculate the following alignment statistics:

A. average UNaligned length in alignment

B. std. dev. of UNaligned length in alignment

C. average of pairwise sequence ID %

D. std. dev. of pairwise sequence ID %

Here are my 2 problems that I seek your help with:

1. I can calculate A and C (averages) using alistat that comes with UBUNTU, but not B or D (standard deviations).

2. For the really large alignments, due to excessive RAM requirements, I've had to use alistat's -f (fast) option, which estimates average %id by "sampling"

So far I've only used alistat for this work. But if there are other tools / tricks to report A - D, while having reasonable RAM and disk-usage footprints, and quick processing times, please let me know.

Thanks!

MSA statistics • 1.5k views

ADD COMMENT • link updated 5.4 years ago by Biostar 20 • written 5.5 years ago by Anand Rao ▴ 640

0

Entering edit mode

How much RAM do you have available?

Here are some possible options:

You may be able to overcome the memory issues of C and D by using generators (e.g. in Python) to iterate each of your sequences one at a time so that the whole file is not in memory at once. You can write all the pairwise %IDs to a file on the fly, so they do not need to all be stored in memory together either. A matrix of %ID values may not be too massive in memory anyway though.

Alternatively, split your alignments down in to individual files on disk (e.g. as an aligned fasta), then loop over these in a pairwise manner and create a separate result file.

Once you have all the pairwise %IDs, you can just import the data in to R or something to actually do the arithmetic.

ADD REPLY • link 5.5 years ago by Joe 21k

0

Entering edit mode

You can estimate all desired values by sampling (like with the -f option). You can implement this by selecting a number of aligned sequences at random from each file. You can also use bootstrapping if you need confidence intervals.

ADD REPLY • link 5.5 years ago by Jean-Karim Heriche 27k