Find A (Gapped) Sequence Within A Multiple Sequence Alignment

0

Entering edit mode

11.5 years ago

sorrywm ▴ 10

I have a large multiple-sequence alignment (~70,000 columns) from which I would like to extract a small number (~1,000) of columns. Specifically, I want the columns that correspond to particular positions in the ungapped sequence of one of the records in the MSA.

One way to do this (I think) would be to cycle over the columns and keep a tally of how many non-gap base pairs have been parsed for the record in question, or even create a 1:1 mapping (storing which column contains the n-th non-gap base pair). But is there a faster way?

I am most familiar with biopython, so a solution in that framework would be easiest.

Thank you!

msa alignment biopython • 2.4k views

ADD COMMENT • link 11.5 years ago by sorrywm ▴ 10

0

Entering edit mode

You could use GBlocks to remove all the gapped columns...but then that would throw off your position numbers

ADD REPLY • link 11.5 years ago by anin.gregory ▴ 110

Login before adding your answer.