how to convert a long fasta-file into many separate single fasta sequences

Entering edit mode

11.2 years ago

natasha.sernova ★ 4.0k

Hello, everybody!

I am looking for the procedure reversed to cat.

I would like to separate a long fasta-file into many single fasta sequences, so each sequence from the large fa-file will be situated in a new small file.fa each. I usually used something like the following, but it not as good as it looks like... Where is the mistake?

Many thanks for your help!

sub get_fasta {
  my $inputName = $_[0];
  my $fh;
# new input record separator for fasta-files
  $/ = "\n>";
open ($fh, $inputName) or die "cannot open file: $!\n";
my @entries = <$fh>;
 # remove last character '>' except for the last entry
  chop(@entries[0..$#entries - 1]);
# add '>' except for the 1st entry
  foreach (@entries[1..$#entries]) {$_ = '>'.$_;}
# print"@entries";
close($fh);
  $/="\n";
  return \@entries;

sequence fasta • 41k views

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.2 years ago by natasha.sernova ★ 4.0k

Entering edit mode

Since you were trying for a Perl solution, this answer will help: Splitting A Fasta File

Here are some more solutions for you in other posts:

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.2 years ago by SES 8.6k

Entering edit mode

Thank you, I've missed really a lot.

Natasha

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.2 years ago by natasha.sernova ★ 4.0k

Entering edit mode

11.2 years ago

dariober 15k

Another option in bash... Possibly more concise and readable...

while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=${line#>}.fa
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done < myseq.fa

You input file is myseq.fa. Individual files will have the name of the corresponding sequence, without leading >. If your sequence names contain funny characters (non alphanumeric) it will probably break but could be fixed depending exactly how you want to call the output files.

ADD COMMENT • link updated 4.9 years ago by Ram 45k • written 11.2 years ago by dariober 15k

Entering edit mode

Hey all, I just tried to use this script and got an error for ambiguous redirect. Any ideas? Thanks in advance.

ADD REPLY • link 6.2 years ago by mdem7705 ▴ 10

Entering edit mode

Hi, placing quotation marks around the $outfile (ie. "$outfile") should fix it. Hope you have figured it out earlier

ADD REPLY • link 5.3 years ago by ct213 • 0

Entering edit mode

nice use of bash substitutions.

ADD REPLY • link 11.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

Perfect! Now it has given even the sub-sequence names from the large fasta as

each small fasta-file name! Really great!

Thousand thanks!

Natasha

ADD REPLY • link 11.2 years ago by natasha.sernova ★ 4.0k

Entering edit mode

one of the best one-liners to split a fasta file.

ADD REPLY • link 7.6 years ago by Prakki Rama ★ 2.7k

Entering edit mode

How could this be modified to: 1. Instead name the output file(s) the same as the input file and perhaps appended with the corresponding sequence name? 2. Run as a loop so that this could be used to split hundreds of different input files (each with different names corresponding to gene names).

for file in $(ls *.fasta)
do
while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=$file.${line#>}.fasta
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done
done

I can't seem to figure it out.

ADD REPLY • link 7.4 years ago by nmkn ▴ 10

Entering edit mode

11.2 years ago

Pierre Lindenbaum 166k

linearize your fasta sequence (two lines per sequence) and use split -l 2

cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2 - seq_

ADD COMMENT • link updated 4.9 years ago by Ram 45k • written 11.2 years ago by Pierre Lindenbaum 166k

Entering edit mode

Dear Pierre,

It's a very nice idea - to kill "cat " with the same weapon! I will definitely try this!

Thank you very much! It did work as well.

Natasha

ADD REPLY • link 11.2 years ago by natasha.sernova ★ 4.0k

Entering edit mode

It works very nice! plus i found that you can add the option --additional-suffix to add the extension you prefer.

cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2 --additional-suffix=.fasta - seq_

ADD REPLY • link updated 4.9 years ago by Ram 45k • written 7.6 years ago by jcsanchez2 • 0

Entering edit mode

This one could spilt a big file into each, but I would like use the header name as each file name, is that possible?
For example: here is a big file:
>OTU1
ATCGATCGATCGTCGATCG
...
>OTU1000
TCGATCGTCGATCGTCGATCGTCGATCG
The output should be like this:
OTU1.fasta
...
OTU1000.fasta

ADD REPLY • link 4.9 years ago by yaqinguo629 • 0

Entering edit mode

Here is the answer that I got, I hope to help novices:
1:first step to spilt into each file
cat *.fasta |\
awk '/^>/ {if(N>0) printf("\n"); printf("%s\n",$0);++N;next;} { printf("%s",$0);} END {printf("\n");}' |\
split -l 2  - OTU
2:second step to add extension
for file in OTU*
do
 mv "$file" "$file.fasta"
done
3:third step to change the file name to the header name
for i in *.fasta; do 
mv $i $(head -1 $i | cut -f1 -d ' ' | tr -d '>' ).fasta
done
By doing these, I got the output that what I want. Thanks for people who gave me an idea.

ADD REPLY • link 4.9 years ago by yaqinguo629 • 0

Entering edit mode

11.2 years ago

Devon Ryan 105k

Reading the whole genome into memory seems like a rather wasteful way to go about things. Just process things like by line (note, this is a random python script I have sitting around, you'd need to edit it):

#!/usr/bin/env python
f=open("genome.fa","r");
opened = False
for line in f :
    if(line[0] == ">") :
        if(opened) :
            of.close()
        opened = True
        of=open("%s.fa" % (line[1:].rstrip()), "w")
        print(line[1:].rstrip())
    of.write(line)
of.close()

You'd obviously need to change the filename. This would also need to be changed if you have descriptors after a contig/chromosome name, but it's enough to give you the idea.

ADD COMMENT • link updated 4.9 years ago by Ram 45k • written 11.2 years ago by Devon Ryan 105k

Entering edit mode

Thanks for the script Devon! It works well on linux and unix based systems. I edited it with a bit of annotation.

#  GENERAL USAGE: python ./scripts/Divide_Reference.fa_by_Contig.py ./data/reference/reference.fa

import os
from os import path
import sys

infile=open(sys.argv[1])

os.system("cd /mnt/scratch/steepale/birdman/MDV_project/DNAseq/GATK_Variant_Analysis_Germline_DNA") # Add your project directory in here

path = sys.argv[1].split('/')[0] + "/" + sys.argv[1].split('/')[1] + "/" + sys.argv[1].split('/')[2]

ref_file = path + "/" + sys.argv[1].split('/')[3]
print('\n' + "reference file: " + ref_file + '\n')
num_contigs = 'grep "^>" %s  | wc -l' % (ref_file)
print("The number of contigs in reference fasta:")
os.system(num_contigs)

opened = False # Assume outfile is not open

for line_ref in infile:
    if line_ref[0] == ">": # If line begins with ">"
        if(opened): 
            outfile.close() # Will close the outfile if it is open (see below and follow loop)
        opened = True # Set opened to True to represent an opened outfile
        contig_name = line_ref[1:].rstrip() #Extract contig name: remove ">", extract contig string, remove any spaces or new lins following file
        print("contig: " + contig_name)
        print("Name of outfile: " + path + "/" + str(sys.argv[1].split('/')[3].split(".")[0]) + "_" + str(contig_name) + ".fa") 
        outfile=open(path + "/" + str(sys.argv[1].split('/')[3].split(".")[0]) + "_" + str(contig_name) + ".fa", 'w')
    outfile.write(line_ref) # Write the line to the file. If line started with ">" a new output file was created and ready to be written to.
                            # Subsequent lines (AKA the sequence) will also be written as the script loops through lines.
                            # Not until we reach a new line beginning with ">" is the outfile closed and new outfile generated with new name.
outfile.close()
print "Fin"

ADD REPLY • link 9.3 years ago by alec.steep ▴ 10

Entering edit mode

Looks good, if you really want to go whole hog then add argparse in there :)

ADD REPLY • link 9.3 years ago by Devon Ryan 105k

Entering edit mode

Hello!

Unfortunately, I'm very bad in Python. I will try to edit your script, but I am not quite sure of the result.

Natasha

ADD REPLY • link updated 4.9 years ago by Ram 45k • written 11.2 years ago by natasha.sernova ★ 4.0k

Entering edit mode

Dear Devon,

I've run your script, it gave me just names of the sequences

in the fasta-file, no sequences at all.

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 11.2 years ago by natasha.sernova ★ 4.0k

Entering edit mode

Perhaps you have Windows line endings, in which case open with "rU" rather than just "r". Other than that, what you're describing wouldn't be a possible outcome of what I posted unless the original file also contains only contig/chromosome names.

ADD REPLY • link 11.2 years ago by Devon Ryan 105k

Entering edit mode

Dear Devon, I'm really sorry!

It did work in the first time, and I now have a nice list of sepated fasta-files!

Many-many thanks!

Natasha

ADD REPLY • link 11.2 years ago by natasha.sernova ★ 4.0k

Entering edit mode

11.2 years ago

Matt Shirley 10k

Easy Python script using a package I wrote:

	from pyfaidx import Fasta, wrap_sequence
	fa = Fasta('multi.fasta')
	for seq in fa:
	with open('{}.fa'.format(seq.name), 'w') as out:
	out.write('>{}\n'.format(seq.name))
	for line in wrap_sequence(70, str(seq)):
	out.write(line)

view raw split_fasta.py hosted with ❤ by GitHub

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 11.2 years ago by Matt Shirley 10k

Entering edit mode

Great, thank you!

Natasha

ADD REPLY • link 11.2 years ago by natasha.sernova ★ 4.0k

Entering edit mode

11.1 years ago

hpmcwill ★ 1.2k

A couple of options that come to mind:

EMBOSS seqretsplit
csplit, for example: csplit -f outfile. input.fa '%^>%' '/^>/' '{*}'

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 11.1 years ago by hpmcwill ★ 1.2k

Entering edit mode

8.3 years ago

iuli.151293 ▴ 10

If you want to separate a multi-fasta file, you can use the above script but you have to delete the fasta header. if you want a faster way, you can use the following script. It allows you to separate each sequence in an individual fasta file and the name of that file will be the first 11 characters after the ">" without deleting the header.

while read line
do
    if [[ ${line:0:1} == '>' ]]
    then
        outfile=${line:1:11}.fa
        echo $line > $outfile
    else
        echo $line >> $outfile
    fi
done < myseq.fa

ADD COMMENT • link 8.3 years ago by iuli.151293 ▴ 10

Entering edit mode

8.5 years ago

BioApps ▴ 800

This tool I built will split a multiFasta, SFF, FastQ file in individual OR smaller-block fasta files. The tool works also on Linux via WinE without any drawbacks. It can also convert between different formats and do sequence processing as FastQC.

split mutiple fasta to single/indiviual fasta

ADD COMMENT • link 8.5 years ago by BioApps ▴ 800