I have ~ 18000 files, each with clustal formatted protein alignments derived from Pfam-A.full. Some of these files are large > 500MB in size, the largest alignment is 3GB!
I need to calculate the following alignment statistics:
A. average UNaligned length in alignment
B. std. dev. of UNaligned length in alignment
C. average of pairwise sequence ID %
D. std. dev. of pairwise sequence ID %
Here are my 2 problems that I seek your help with:
1. I can calculate A and C (averages) using alistat that comes with UBUNTU, but not B or D (standard deviations).
2. For the really large alignments, due to excessive RAM requirements, I've had to use alistat's -f (fast) option, which estimates average %id by "sampling"
So far I've only used alistat for this work. But if there are other tools / tricks to report A - D, while having reasonable RAM and disk-usage footprints, and quick processing times, please let me know.
Thanks!
How much RAM do you have available?
Here are some possible options:
You may be able to overcome the memory issues of C and D by using generators (e.g. in Python) to iterate each of your sequences one at a time so that the whole file is not in memory at once. You can write all the pairwise %IDs to a file on the fly, so they do not need to all be stored in memory together either. A matrix of %ID values may not be too massive in memory anyway though.
Alternatively, split your alignments down in to individual files on disk (e.g. as an aligned fasta), then loop over these in a pairwise manner and create a separate result file.
Once you have all the pairwise %IDs, you can just import the data in to R or something to actually do the arithmetic.
You can estimate all desired values by sampling (like with the -f option). You can implement this by selecting a number of aligned sequences at random from each file. You can also use bootstrapping if you need confidence intervals.