Question

splitting Multifasta File Into A *Smaller Multifasta File containing few sequence

1

Entering edit mode

7.9 years ago

kabir.deb ▴ 90

I am trying to split a large multifasta file into several smaller mutlifasta files. I have seen several examples of being used to split multifasta files into single fasta, but for not more than one fasta in a single file. Particularly I'm trying to split a fasta file containing 1300 sequences into 100 small files having sequentially 13 sequences in each (same accession number). Hoping for good suggestions. Thanks in advance.

multifasta split small multifasta • 5.9k views

ADD COMMENT • link updated 7.9 years ago by zjhzwang ▴ 180 • written 7.9 years ago by kabir.deb ▴ 90

0

Entering edit mode

This post may be helpful.

how to convert a long fasta-file into many separate single fasta sequences

ADD REPLY • link 7.9 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

7.9 years ago

Vitis ★ 2.6k

Try using BioPerl or BioPython modules handling FASTA, you can create tools to do all kinds of manipulation to FASTA files.

http://bioperl.org/howtos/SeqIO_HOWTO

http://biopython.org/wiki/Documentation

ADD COMMENT • link 7.9 years ago by Vitis ★ 2.6k

0

Entering edit mode

7.9 years ago

Prasad ★ 1.6k

try UCSC utility, faSplit has many options to split a fasta file. For your case

faSplit sequence input.fasta 100 out.fasta

where 100 is input file will be split into 100 files

ADD COMMENT • link 7.9 years ago by Prasad ★ 1.6k

0

Entering edit mode

7.9 years ago

kabir.deb ▴ 90

Actually, I was using "pyfasta split -n 100 input.fasta".

But problem is that by doing this way 100 different files are generating randomly. which is not exactly what I'm looking for; yes Brian, my large multi fasta file have 100 different accessions, each with 13 sequences sequentially and total 1300 sequence in the input.fasta file. means Acc. no. NC_12345 has 13 fasta sequences in a row then the next Acc. no. NC_12346 has 13 fasta sequences and so on for 100 accessions. Now I need to just sepearate it out according to accession or first 13 sequences in each file.

ADD COMMENT • link 7.9 years ago by kabir.deb ▴ 90

0

Entering edit mode

OK, either of the two demuxbyname commands will work, then. The first if you have a list of accessions in a text file; or the second if you don't, but the headers start with the accession.

Also, theoretically, you could do this:

partition.sh in=file.fasta out=temp_%.fasta ways=13
cat temp_*.fasta > catted.fasta
partition.sh in=file.fasta out=final_%.fasta ways=100

That would give the first 13 records in final_0.fasta, the second 13 records in final_1.fasta, etc. through final_99.fasta. It would work regardless of the names, if you had exactly 1300 records.

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k

0

Entering edit mode

7.9 years ago

kabir.deb ▴ 90

Oh thanks Brian, thank you very much; finally using demuxbyname the problem has been solved...

ADD COMMENT • link 7.9 years ago by kabir.deb ▴ 90

0

Entering edit mode

7.9 years ago

mks002 ▴ 220

Run the perl code below and you will get desired result.

perl fasta_per_line.pl fasta_file_name 100

$file=$ARGV[0];
$f_size=$ARGV[1];
chomp($file);
chomp($f_size);

if (!$file || !$f_size)
{
    die "\nUSAGE: perl fasta_per_line.pl <file_name> <Number>\n\n<file_name>- Multiple fasta file name\n<Number>- Number of fasta sequences to be put in each new file (Must be less than total number of sequnecs in input file)\n\n";
}

%seq_hash=();
open AA, "<$file";

foreach (<AA>)
{
    chomp($_);
    if ($_=~/^>(.*)/)
    {       
        $id=$1;
        push (@aaa, $id);
    }
    else
    {
        $seq_hash{$id} .=$_;
    }
}

$file_count=1;$seq_count=1;
open KK, ">$file.1";
foreach $a(@aaa)
{
    print KK ">".$a."\n", $seq_hash{$a}."\n";
    $seq_count++; 
    if ($seq_count>$f_size)
    {
        close(KK); $file_count++; $seq_count=1;
    }
    if ($seq_count == 1)
    {
        $name=$file_count.$file;
        open KK, ">$name";
    }
}
close(KK);

ADD COMMENT • link 7.9 years ago by mks002 ▴ 220

0

Entering edit mode

7.9 years ago

zjhzwang ▴ 180

You can use one sample DOS command{split},and the usage is :

split [-n] file [name]

-n is the line numbers one file contain.
[name] is the result files start name,for example,if test,the result files will be testaa.txt,testab.txt...
Wish it will be usefull for you.

ADD COMMENT • link 7.9 years ago by zjhzwang ▴ 180

score 4 · Accepted Answer · 2016-12-18

Your description is not completely clear in terms of what you mean by "same accession number"; does everything have the same accession, or are there 100 different accessions, each with 13 sequences? Regardless, you can use the BBMap package like this:

partition.sh in=file.fasta out=out_%.fasta ways=100

That will give you 100 fastas, each with an equal number of sequences. Alternatively, if you want to split by accession number:

demuxbyname.sh in=file.fasta out=out_%.fasta names=names.txt substringmode

If all of your accessions are listed in "names.txt", this will produce one file per accession, containing all the sequences labelled with that accession. Alternately, if the access is the (for example) first 12 characters of the sequence names, and you don't have a list of accessions, you could do this:

demuxbyname.sh in=file.fasta out=out_%.fasta prefixmode length=12