Hi.
I have a multifasta sequence containing some of the human proteins of my interest (N>5000). all the proteins are usually of different length.
I want to split all the sequences in that file into TEN equal parts. e.g. if a protein is of 200 amino acid long, each nonoverlapping sequence should be of 20 amino acid long. this will apply to all fasta sequences of different length in the main file.
Also, the output should be TEN multifasta files, one for each of the TEN SEGMENTS. For example, for an 100 amino acid protein, Multifasta1 should contain the first segment (representing first 1-10 amino acids), Multifasta2 should contain the second (10-20) amino acid sequences and so on.
In the multifasta files, the desired portion of all proteins should be appended. I am not an expert coder, therefore any help will be greatly appreciated.
Thanks in advance.
Thank you for data description. Please post example input, expected output and your efforts. Try seqkit split function.
You will need to first split the original file into constituent protein files. THEN split those individual files into 10 pieces.
faSplit
utility from Jim Kent that is linked in @Juke34's tutorial would be one option to do both.