Question

How To Split One Big Sequence File Into Multiple Files With Less Than 1000 Sequences In A Single File

9

Entering edit mode

13.6 years ago

Hamilton ▴ 290

Hi,

I am too native here, sorry first for trivial question,

I am trying to split one big sequence FASTA file into multiple files with less than 1000 sequences in a single file.

how can i generate them?

Any inputs will be very appreciated,

next-gen sequencing fasta rna r perl • 98k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 13.6 years ago by Hamilton ▴ 290

Ram · Answer 1 · 2011-10-19

28

Entering edit mode

13.6 years ago

Pierre Lindenbaum 166k

using awk:

awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%1000==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < sequences.fa

ADD COMMENT • link updated 5.2 years ago by Ram 45k • written 13.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Works fine, but if input file is large enough it prints error: awk: cannot open "myseq25525.fa" for output (Too many open files). But, as in my case, one can split larger than 25000 sequence file to few smaller.

ADD REPLY • link 13.4 years ago by boczniak767 ▴ 870

0

Entering edit mode

It works good, but if I have to split my file (60.000 sequences) to files with 25 sequences I get an error awk: cannot open "myseq25525.fa" for output (Too many open files). But I can just split my large file to smaller parts.

ADD REPLY • link 13.4 years ago by boczniak767 ▴ 870

0

Entering edit mode

Hi,

How can I change the output format from myseq0000.fa, myseq1000.fa..... to include the name of input fasta sequence.

E.g.,

Input fasta file name: ABC.fa

Output file names: ABC_myseq0000.fa

Your inputs will be much appreciated.

Vijay

ADD REPLY • link 10.1 years ago by vijay.bhaaskarla • 0

0

Entering edit mode

Thank you, it is very easy way of splitting big fasta file. :)

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.8 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

hello, is there any awk command for extracting upstream and downstream region fron the seq?

ADD REPLY • link 9.8 years ago by harpreetmanku04 ▴ 10

0

Entering edit mode

I am sorry to ask a certainly silly question but I am trying to do this with a fasta file containing 300 000 contigs to separate in 6 files of less than 50 000 sequences. I have made a test.fa file containing 15 sequences to try awk described above did not work in R (may be obvious)

I have just started R and would like a prog or extension that I could use in R Anyone could redirect me pleease

ADD REPLY • link 8.9 years ago by alexandra.lanot • 0

1

Entering edit mode

Awk is one of bash commands, not R.

ADD REPLY • link 8.9 years ago by boczniak767 ▴ 870

0

Entering edit mode

I tried this for my sequence file , but not working

awk: cmd. line:1: (FILENAME=- FNR=1) fatal: expression for `>>' redirection has null string value

ADD REPLY • link 8.3 years ago by tcf.hcdg ▴ 70

1

Entering edit mode

For me it seems that you haven't provide the output file name (required after the>>).

ADD REPLY • link 8.3 years ago by boczniak767 ▴ 870

0

Entering edit mode

got it. my mistake. Now worked

ADD REPLY • link 8.3 years ago by tcf.hcdg ▴ 70

Ram · Answer 2 · 2011-10-19

15

Entering edit mode

13.6 years ago

Giovanni M Dall'Olio 28k

I think you can do it with PyFasta:

split a fasta file into 6 new files of relatively even size:
 $ pyfasta split -n 6 original.fasta
  

If you know the number of sequences in the file (just grep '>' original_fasta | wc -l), you can easily calculate how to split it in <1000 files.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 13.6 years ago by Giovanni M Dall'Olio 28k

3

Entering edit mode

You can also do this using pyfaidx, with the command:

faidx --split-files original.fasta

Each sequence will be placed in its own file with a filename derived from the sequence identifier.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by Matt Shirley 10k

Ram · Answer 3 · 2011-10-19

I would like to recommend the faSplit utility from the UCSC Genome Browser suite. However, I didn't found a downloaded binary from their site, but you can compile it yourself.

$ faSplit 
faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.  
Files split by sequence will be broken at the nearest fa record boundary. 
Files split by base will be broken at any base.  
Files broken by size will be broken every count bases.

Examples:
   faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries

   faSplit base chr1.fa 10 1_
This will break up chr1.fa into 10 files

   faSplit size input.fa 2000 outRoot
This breaks up input.fa into 2000 base chunks

   faSplit about est.fa 20000 outRoot
This will break up est.fa into files of about 20000 bytes each by record.

   faSplit byname scaffolds.fa outRoot/ 
This breaks up scaffolds.fa using sequence names as file names.
       Use the terminating / on the outRoot to get it to work correctly.

   faSplit gap chrN.fa 20000 outRoot
This breaks up chrN.fa into files of at most 20000 bases each, 
at gap boundaries if possible.  If the sequence ends in N's, the last
piece, if larger than 20000, will be all one piece.

Options:
    -verbose=2 - Write names of each file created (=3 more details)
    -maxN=N - Suppress pieces with more than maxN n's.  Only used with size.
              default is size-1 (only suppresses pieces that are all N).
    -oneFile - Put output in one file. Only used with size
    -extra=N - Add N extra bytes at the end to form overlapping pieces.  Only used with size.
    -out=outFile Get masking from outfile.  Only used with size.
    -lift=file.lft Put info on how to reconstruct sequence from
                   pieces in file.lft.  Only used with size and gap.
    -minGapSize=X Consider a block of Ns to be a gap if block size >= X.
                  Default value 1000.  Only used with gap.
    -noGapDrops - include all N's when splitting by gap.
    -outDirDepth=N Create N levels of output directory under current dir.
                   This helps prevent NFS problems with a large number of
                   file in a directory.  Using -outDirDepth=3 would
                   produce ./1/2/3/outRoot123.fa.
    -prefixLength=N - used with byname option. create a separate output
                   file for each group of sequences names with same prefix
                   of length N.

Ram · Answer 4 · 2011-10-19

6

Entering edit mode

13.6 years ago

toni ★ 2.2k

Supposing that the sequence information is written on a single line (one header + one single sequence line), you could simply have a look at split function under Linux.

split -l 2000 input.fasta

man split for more customization of output files generated.

Tony

ADD COMMENT • link updated 5.2 years ago by Ram 45k • written 13.6 years ago by toni ★ 2.2k

7

Entering edit mode

Suppose it isn't :)

ADD REPLY • link 10.8 years ago by Neilfws 49k

1

Entering edit mode

Multi-line to single line fasta

ADD REPLY • link 8.7 years ago by st.ph.n ★ 2.7k

score 3 · Answer 5 · 2018-06-18

A shorter and more user-friendly awk version without a limit on the number of input fasta records:

awk -v size=1000 -v pre=prefix -v pad=5 '/^>/{n++;if(n%size==1){close(f);f=sprintf("%s.%0"pad"d",pre,n)}}{print>>f}' input.fasta

Command arguments are: size -- chunk size, pre -- output file prefix, pad -- padding width (the width of the numeric suffix).

Ungolfed version for clarity:

awk -v size=1000 -v pre=prefix -v pad=5 '
   /^>/ { n++; if (n % size == 1) { close(fname); fname = sprintf("%s.%0" pad "d", pre, n) } }
   { print >> fname }
' input.fasta

Ram · Answer 6 · 2014-09-14

/**
 * This tool aims to chop the file in various parts based on the number of sequences required in one file.
 */
package devtools.utilities;

import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;

//import java.util.List;

/**
 * @author Arpit
 * 
 */
public class FileChopper {

    public void chopFile(String fileName, int numOfFiles) throws IOException {
        byte[] allBytes = null;
        String outFileName = StringUtils.substringBefore(fileName, ".fasta");
        System.out.println(outFileName);
        try {
            allBytes = Files.readAllBytes(Paths.get(fileName));
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        String allLines = new String(allBytes, StandardCharsets.UTF_8);
        // Using a clever cheat with help from stackoverflow
        String cheatString = allLines.replace(">", "~>");
        cheatString = cheatString.replace("\\s+", "");
        String[] splitLines = StringUtils.split(cheatString, "~");
        int startIndex = 0;
        int stopIndex = 0;

        FileWriter fw = null;
        for (int j = 0; j < numOfFiles; j++) {

            fw = new FileWriter(outFileName.concat("_")
                    .concat(Integer.toString(j)).concat(".fasta"));
            if (j == (numOfFiles - 1)) {
                stopIndex = splitLines.length;
            } else {
                stopIndex = stopIndex + (splitLines.length / numOfFiles);
            }
            for (int i = startIndex; i < stopIndex; i++) {
                fw.write(splitLines[i]);
            }
            if (j < (numOfFiles - 1)) {
                startIndex = stopIndex;
            }
            fw.close();
        }

    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        FileChopper fc = new FileChopper();
        try {
            fc.chopFile("H:\\Projects\\filename.fasta",5);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}