How to split all fasta sequences in a multifasta file into ten equal parts
2
0
Entering edit mode
4.1 years ago

Hi.

I have a multifasta sequence containing some of the human proteins of my interest (N>5000). all the proteins are usually of different length.

I want to split all the sequences in that file into TEN equal parts. e.g. if a protein is of 200 amino acid long, each nonoverlapping sequence should be of 20 amino acid long. this will apply to all fasta sequences of different length in the main file.

Also, the output should be TEN multifasta files, one for each of the TEN SEGMENTS. For example, for an 100 amino acid protein, Multifasta1 should contain the first segment (representing first 1-10 amino acids), Multifasta2 should contain the second (10-20) amino acid sequences and so on.

In the multifasta files, the desired portion of all proteins should be appended. I am not an expert coder, therefore any help will be greatly appreciated.

Thanks in advance.

fasta multifasta split equal parts splitfasta • 2.1k views
ADD COMMENT
0
Entering edit mode

Thank you for data description. Please post example input, expected output and your efforts. Try seqkit split function.

ADD REPLY
0
Entering edit mode

You will need to first split the original file into constituent protein files. THEN split those individual files into 10 pieces. faSplit utility from Jim Kent that is linked in @Juke34's tutorial would be one option to do both.

ADD REPLY
2
Entering edit mode
4.1 years ago
Alex Nesmelov ▴ 200

Solution for R with tidyverse and seqinr packages. Don't forget to replace fasta_path with your file name. When a given sequence can't be splitted into 10 equal parts, the last piece is shortened (e.g. piece that goes in the last multifasta file).

library(tidyverse)
library(seqinr)

n_chunks = 10 

fasta_path = "your_file.fa"

fasta_data <- read.fasta(fasta_path, seqtype = "AA")

split_sequence = function(sequence) {
   cut_vector = cut(seq_along(sequence), breaks = n_chunks)
   split(sequence, cut_vector)
}

fasta_splitted = map(fasta_data,
                 split_sequence)

for (current_chunk in 1:n_chunks) {

    current_multifasta = map(fasta_splitted,
                             ~.[[current_chunk]])

    current_multifasta_name = str_c("Multifasta_part_", 
                                   current_chunk,
                                  ".fa")

     write.fasta(current_multifasta,
                names = names(current_multifasta),
                current_multifasta_name)

}
ADD COMMENT
0
Entering edit mode

Thanks a lot.. This is really helpful. Exactly what I needed.

ADD REPLY
0
Entering edit mode
4.1 years ago
Juke34 8.9k

Many solutions are available, have a look at here : Tutorial: FASTA file split

ADD COMMENT

Login before adding your answer.

Traffic: 2066 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6