How to split a Multiple fasta file into separate files having almost similar file size as specified? Do you have any tool for that? But the tool shouldn't split individual fasta entry
Gvj
How to split a Multiple fasta file into separate files having almost similar file size as specified? Do you have any tool for that? But the tool shouldn't split individual fasta entry
Gvj
I suggest the fastasplitn command. It works like a charm for me. pyfasta has similar functionality:
# split a fasta file into 6 new files of relatively even size:
pyfasta split -n 6 original.fasta
The most efficient I know is a GenomeThreader tool (EDIT: genome tools actually):
If you want to constrain the number of output files (here 60):
gt splitfasta -numfiles 60 seqs.fasta
If you want to constrain the size in MB (here 20) of each output file:
gt splitfasta -targetsize 20 seqs.fasta
There is an alternative to brent's pyfasta, using Kent's src.
faSplit - Split an fa file into several files.
usage:
faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.
Files split by sequence will be broken at the nearest fa record boundary.
Files split by base will be broken at any base.
Files broken by size will be broken every count bases.
Examples:
faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries
faSplit base chr1.fa 10 1_
This will break up chr1.fa into 10 files
faSplit size input.fa 2000 outRoot
This breaks up input.fa into 2000 base chunks
faSplit about est.fa 20000 outRoot
This will break up est.fa into files of about 20000 bytes each by record.
faSplit byname scaffolds.fa outRoot/
This breaks up scaffolds.fa using sequence names as file names.
Use the terminating / on the outRoot to get it to work correctly.
faSplit gap chrN.fa 20000 outRoot
This breaks up chrN.fa into files of at most 20000 bases each,
at gap boundaries if possible. If the sequence ends in N's, the last
piece, if larger than 20000, will be all one piece.
If you are on a Linux machine, the pre-compliled files can be found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/
I actually wrote one of these yesterday. Here's the code. It's rough, but hopefully understandable - it uses BioPerl. The idea is that you'll specify how many sequences you want per file e.g. ./splitfasta.pl allseqs.fa splitseqs 100
- it'll create splitseqs.1,2,3 etc each with 100 sequences in them.
#!/usr/bin/perl
use strict;
use Bio::SeqIO;
my $from = shift;
my $toprefix = shift;
my $seqs = shift;
my $in = new Bio::SeqIO(-file => $from);
my $count = 0;
my $fcount = 1;
my $out = new Bio::SeqIO(-file => ">$toprefix.$fcount", -format=>'fasta');
while (my $seq = $in->next_seq) {
if ($count % $seqs == 0) {
$fcount++;
$out = new Bio::SeqIO(-file => ">$toprefix.$fcount", -format=>'fasta');
}
$out->write_seq($seq);
$count++;
}
You can also use a dodgy little bash script in a pinch e.g.:
csplit -z myfile.fas '/>/' '{*}'
http://41j.com/blog/2011/01/split-fasta-file-into-files-with-one-contig-per-file/
You could just use csplit in Linux. You can specify how large the files are and use a simple regex to specify what should be at the start of each new file ('>').
Check here
http://python.genedrift.org/2007/10/10/alternative-methods-to-split-a-fasta-file/
fastasplit in the exonerate package's utilities (bottom of the page) does exactly this.
I use fasta_splitter for this purpose. Its good too!
Another option, from the BBMap package:
partition.sh in=file.fasta out=part%.fasta ways=5
This is multithreaded and very fast. It works on fastq also.
#!/usr/bin/perl -w
my $usage= <<EOF;
This is for split a fasta zigzagly.
usage: perl $0 x.fa num
Warning: you'd better mkdir a new directory for this.
Du Kang 2017-1-17
EOF
#obtain seq
open SEQ, $ARGV[0] or die $usage;
while (<SEQ>) {
chomp;
if (/>/) {
s/>//;
@_=split;
$name=$_[0];
}else{
$seq{$name}.=$_;
}
}
#rank according seq length
foreach $name (keys %seq){
$length{$name}=length $seq{$name};
}
@id=sort {$length{$a} <=> $length{$b}} keys %length;
#zigzag out each seq
$filenum=$ARGV[1] or die $usage;
$n=0;
$flag=0;
foreach $name (@id) {
if ($n>=$flag) {
if ($n==$filenum) {
$flag=$flag+2;
}else{
$flag=$n;
$n++;
}
}else{
if ($n==1) {
$flag=$flag-2;
}else{
$flag=$n;
$n--;
}
}
open OUT, ">>hehe.$n" or die $!;
print OUT ">$name\n$seq{$name}\n";
close OUT;
}
It will rank the sequences according to the length, then zigzag dispatch them to make the result files almost even in size.
/**
* This tool aims to chop the file in various parts based on the number of sequences required in one file.
*/
package devtools.utilities;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;
//import java.util.List;
/**
* @author Arpit
*
*/
public class FileChopper {
public void chopFile(String fileName, int numOfFiles) throws IOException {
byte[] allBytes = null;
String outFileName = StringUtils.substringBefore(fileName, ".fasta");
System.out.println(outFileName);
try {
allBytes = Files.readAllBytes(Paths.get(fileName));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String allLines = new String(allBytes, StandardCharsets.UTF_8);
// Using a clever cheat with help from stackoverflow
String cheatString = allLines.replace(">", "~>");
cheatString = cheatString.replace("\\s+", "");
String[] splitLines = StringUtils.split(cheatString, "~");
int startIndex = 0;
int stopIndex = 0;
FileWriter fw = null;
for (int j = 0; j < numOfFiles; j++) {
fw = new FileWriter(outFileName.concat("_")
.concat(Integer.toString(j)).concat(".fasta"));
if (j == (numOfFiles - 1)) {
stopIndex = splitLines.length;
} else {
stopIndex = stopIndex + (splitLines.length / numOfFiles);
}
for (int i = startIndex; i < stopIndex; i++) {
fw.write(splitLines[i]);
}
if (j < (numOfFiles - 1)) {
startIndex = stopIndex;
}
fw.close();
}
}
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
FileChopper fc = new FileChopper();
try {
fc.chopFile("H:\\Projects\\Lactobacillus rhamnosus\\Hypothetical proteins sequence 405 LR24.fasta",5);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The
gt
program is actually part of the GenomeTools library. The GenomeThreader tool (gth
on the command line) is developed by the same group but is used for spliced alignment.You are right ;-) I edit the answer now.
6 years after: Thanks Manu, that one is fast !