I have a large multiple-sequence alignment (~70,000 columns) from which I would like to extract a small number (~1,000) of columns. Specifically, I want the columns that correspond to particular positions in the ungapped sequence of one of the records in the MSA.
One way to do this (I think) would be to cycle over the columns and keep a tally of how many non-gap base pairs have been parsed for the record in question, or even create a 1:1 mapping (storing which column contains the n-th non-gap base pair). But is there a faster way?
I am most familiar with biopython, so a solution in that framework would be easiest.
Thank you!
You could use GBlocks to remove all the gapped columns...but then that would throw off your position numbers