I have a fasta file with sequences of about 100 amino acids and i need to expand them on both sides with the corresponding amino acids to get a fasta file that contains the entire domain sequences instead of the 100 amino acid stretches. I try to write a biophyton script that does the job, but i'm an absolute beginner and would be glad for any advice on how to do that. So i figured that my script should first perform a blast search for all the sequences, take the top hit and then somehow use it to expand the query sequence. However I don't really know how to implement that (except for performing the blast search). Any help would be appreciated, thank you.
Could you define 'expand' and perhaps elaborate on the first sentence of your question?
Sorry for the confusion. What i need is a dataset of the full sequences of homologous domains (as many as possible) from a given protein class. There is a publication where they use a library of amino acid stretches (from homologous proteins) as a training set for their algorithm. However these stretches comprise only maybe 1/3 of the sequence i need. So I would need a script that goes through the entire list of sequences, finds for each stretch the full amino acids sequence of the corresponding protein and then adds on both ends of the query sequence the amino acids missing for the entire domain im interested in. As you suggest my plan is to first define a strategy, step by step, and then implement it in python. So my idea was to (1) perform a blast search for each sequence, (2) get the highest hit, (3) create an alignment of the two sequences, (4) add a defined number of aa on both sides according to the alignment. Does that make any sense? Any hints on how the steps 3 and 4 could look like? I hope its clearer now.