Question

Library generation based on known diversity of specific residues.

0

Entering edit mode

15 months ago

DrPsych • 0

Hello,

I am trying to figure out a way to generate a library of sequences based on known diversity at specific residues. For instance, if position 1 has 50% A and 50% C the library would have a diversity in which 50% of the sequences had an A in position 1 and 50% would have C in position 1. Ideally, this would be python based but I can't seem to find any packages that can do this for entire DNA sequences given percentages of diversity at specific locations.

Thanks

library python generation • 763 views

ADD COMMENT • link updated 15 months ago by biomarco ▴ 50 • written 15 months ago by DrPsych • 0

score 0 · Answer 1 · 2023-08-19

0

Entering edit mode

15 months ago

biomarco ▴ 50

If you don't have a very big number of sequences, one solution would be parsing your alignment through the Biopython module AlignIO, then you put all the sequences into a pandas DataFrame and use the pandas groupby function to filter them based on the letters at your specific positions of interest. If the sequences are too many to be kept altogheter in the memory as a dataframe, you can still process them row by row.

ADD COMMENT • link 15 months ago by biomarco ▴ 50

0

Entering edit mode

I'm fairly new to the world of bioinformatics/coding so I might not be understanding the functions completely but to me this seems like it would align all of my sequences and then allow me to filter for diversity based on position which is very helpful for what I'm doing. I'm also trying to figure out a good way for the next step of this which is to generate a library of 10^8-10^10 unique sequences based on a specific diversity I already have and was wondering if there was a simple way to do this. I would like this library to represent the diversity at every residue so if there is an A at position 1 25% of the time and a T at position 300 10% of the time within my aligned sequences the library would also have this same percentage within its sequences and I would like to do this at almost if not every position within the DNA sequence.

ADD REPLY • link 15 months ago by DrPsych • 0

0

Entering edit mode

No I think there's some misunderstanding. AlignIO would allow you to load a multiple sequence alignment. First you have to obviously align the sequences using clustalo or similar software.

However, I read once again your original question and I think that I got it wrong. Basically you already know the conservation at each position, and you would like to create a dataset of sequences of that kind. I think that in theory you should create a profile (or pick one from a profile database) and use it for an HMMER search. This is the only thing I can think about that somehow approaches the solution to your problem.

You can have a look at the HMMER website while you wait for someone more experienced than me to answer.

ADD REPLY • link 15 months ago by biomarco ▴ 50