Hello,
I am trying to figure out a way to generate a library of sequences based on known diversity at specific residues. For instance, if position 1 has 50% A and 50% C the library would have a diversity in which 50% of the sequences had an A in position 1 and 50% would have C in position 1. Ideally, this would be python based but I can't seem to find any packages that can do this for entire DNA sequences given percentages of diversity at specific locations.
Thanks
I'm fairly new to the world of bioinformatics/coding so I might not be understanding the functions completely but to me this seems like it would align all of my sequences and then allow me to filter for diversity based on position which is very helpful for what I'm doing. I'm also trying to figure out a good way for the next step of this which is to generate a library of 10^8-10^10 unique sequences based on a specific diversity I already have and was wondering if there was a simple way to do this. I would like this library to represent the diversity at every residue so if there is an A at position 1 25% of the time and a T at position 300 10% of the time within my aligned sequences the library would also have this same percentage within its sequences and I would like to do this at almost if not every position within the DNA sequence.
No I think there's some misunderstanding. AlignIO would allow you to load a multiple sequence alignment. First you have to obviously align the sequences using clustalo or similar software.
However, I read once again your original question and I think that I got it wrong. Basically you already know the conservation at each position, and you would like to create a dataset of sequences of that kind. I think that in theory you should create a profile (or pick one from a profile database) and use it for an HMMER search. This is the only thing I can think about that somehow approaches the solution to your problem.
You can have a look at the HMMER website while you wait for someone more experienced than me to answer.