I have a fasta file which contains thousands of sequences, with headers as such:
>scaffold_1|c(135298..135582)|DNA|DNA-0-1_NV
Each pipe-deliminated section of the header can vary from sequence to sequence, and some sequences might have identical headers except for the first or second sections.
I need to be able to search through this large file and pick out and print to another file specific sequences based upon their header. There needs to be degeneracy in this search however. I have seen examples where a library text file was used but only exact matches between the fasta file and library file would work.
For instance, let's say I want all sequences which have any variation on 'piggyBac' in their header (so PiggyBac, piggybac,DNA-piggyBac, etc.).
I'm just at a loss as to how to do this exactly. Is there some way to index this file and then search the keys for variations on 'piggyBac'? If anyone has suggestions or can point me to code that does something similar it would really be helpful.
I appreciate it
I should mention I'm using Python3 and the latest release of BioPython
It would be better to say which version of Python 3 (e.g. 3.3? 3.4?) and which version of Biopython (e.g. 1.65?), partly people may be reading this question in a years time, but also in case it helps give you are more accurate answer.
Edited to show versions for both