How to split a long protein sequence in small sequences with same header?
2
0
Entering edit mode
2.4 years ago
FelipeMSD • 0

Dear all,

I have a protein catalogue (fasta file) with more than 500 sequences that are longer than 10k aa and because of the limitation of a tool I am using I need to split these long sequences in small sequences with up to 10k aa that should have the same header as their original versions.

So, for example, imagine that I have a FASTA file with 3 sequences with 12aa each:

>Header_1
AAAAAABBBBBB
>Header_2
CCCCCCDDDDDD
>Header_3
EEEEEEFFFFFF

What could I do to have a file with them split in 2 sequences with 6aa each, keeping the same header (as below)?

>Header_1
AAAAAA
>Header_1
BBBBBB
>Header_2
CCCCCC
>Header_2
DDDDDD
>Header_3
EEEEEE
>Header_3
FFFFFF

Is there any tool or bash command I could use? Thanks in advance ;)

fasta • 2.7k views
ADD COMMENT
1
Entering edit mode

just a word of caution: it's not good practise to create fasta files with the exact same header for different proteins (== avoid duplicate headers in a fasta file) . Many tools that parse in fasta files will map them on the header so you will overwrite previous sequences when there are duplicated headers

ADD REPLY
0
Entering edit mode

That's a very good tip! Thanks! In that case, my headers are NCBI taxonomic IDs and I am using Kaiju to align my reads with that protein catalogue. So, as I just want to assign taxonomy to the reads through this alignment, I think there will be no problems.

ADD REPLY
1
Entering edit mode

if fasta file is as simple as that, you can try following simple code:

$ bioawk -c fastx  '{printf(">%s\n%s\n>%s\n%s\n",$name,substr($seq,1,length($seq)/2),$name,substr($seq,length($seq)/2+1,length($seq)))}' test.fa

>Header_1
AAAAAA
>Header_1
BBBBBB
>Header_2
CCCCCC
>Header_2
DDDDDD
>Header_3
EEEEEE
>Header_3
FFFFFF
ADD REPLY
0
Entering edit mode

On what basis are you splitting the sequences? A particular length? Or some feature of the sequence itself?

ADD REPLY
0
Entering edit mode

A particular length. I would like to split them so they can be up to 10k aa long.

ADD REPLY
3
Entering edit mode
2.4 years ago
JC 13k

Perl code:

#!/usr/bin/perl

use strict; 
use warnings;

my $word_size = 6;

$/="\n>"; # read sequence blocks
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    while  ($seq) {
        my $sub_seq = substr($seq, 0, $word_size);
        substr($seq, 0, $word_size) = '';
        print ">$id\n$sub_seq\n";
    }
}

Test:

$ perl split.pl < seqs.fa 
>Header_1
AAAAAA
>Header_1
BBBBBB
>Header_2
CCCCCC
>Header_2
DDDDDD
>Header_3
EEEEEE
>Header_3
FFFFFF
ADD COMMENT
0
Entering edit mode

Thank you very much JC! Just used that and it worked very well :)

ADD REPLY
2
Entering edit mode
2.4 years ago
Joe 21k

Here's some python code you can use:

from Bio import SeqIO
import argparse, sys


def main():
    try:
        parser = argparse.ArgumentParser(
            description="This script will split a provided FASTA file in to multiple sequences up to a defined length."
        )
        parser.add_argument(
            "-i",
            "--input",
            action="store",
            required=True,
            help="The FASTA to be processed.",
        )
        parser.add_argument(
            "-l",
            "--length",
            action="store",
            type=int,
            required=True,
            help="The length of substring to split by.",
        )

        args = parser.parse_args()

    except NameError:
        sys.stderr.write(
            "An exception occured with argument parsing. Check your provided options."
        )

    for record in SeqIO.parse(args.input, "fasta"):
        chunks = [record.seq[i:i + args.length] for i in range(0, len(str(record.seq)), args.length)]
        for index, chunk in enumerate(chunks):
            print(">" + record.description + "_" + str(index) + "\n" + chunk)


if __name__ == "__main__":
    main()

Use it as python scriptname.py -i fastafile.fa -l 6 (or whatever length you want)

On your test input:

>Header_1_0
AAAAAA
>Header_1_1
BBBBBB
>Header_2_0
CCCCCC
>Header_2_1
DDDDDD
>Header_3_0
EEEEEE
>Header_3_1
FFFFFF

Note that I've added an additional number to each header because @Lieven makes a good point that having multiple sequences with identical headers is likely to cause problems. You can amend the loop to remove this quite easily if you wish.

It should tolerate the 'leftovers' which are less than your specified threshold if you sequences aren't exact multiples of your chosen length, which I imagine they wont be.

ADD COMMENT

Login before adding your answer.

Traffic: 2476 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6