how to split multi-fasta file into single fasta file named by header
5
2
Entering edit mode
3.7 years ago
Kumar ▴ 120

I have a multi-fasta file namely genome.fasta as follows

genome.fasta
>LI5896452.1 Liverpool 2 kg/dp/Kng
ATGCTAG
>1582.LC madi kg 5/58/8
GATGAT

I need to split the genome.fasta file into single fasta file and file name should be the corresponding first word of the fasta header. The expected output as follows,

LI5896452.1.fasta
>LI5896452.1 Liverpool 2 kg/dp/Kng
ATGCTAG

1582.LC.fasta
>1582.LC madi kg 5/58/8
GATGAT

I found many script available online but all are splitting the file and naming each by its own, I could not find any script which keeps header as file name. Please help me to do the same.

genome perl python3 bash python • 9.1k views
ADD COMMENT
1
Entering edit mode

Linearize your fasta file using code here


Then use the solutions in: Split Fasta file and rename output files with contig names

ADD REPLY
1
Entering edit mode

with awk and flattened fasta:

$ cat test.fa                                                                                                                                                                           
>LI5896452.1 Liverpool 2 kg/dp/Kng
ATGCTAG
>1582.LC madi kg 5/58/8
GATGAT

$ awk -v OFS="\n" '/^>/ {getline seq; print $0,seq > substr($1,2)".fa"}' test.fa  

$ tree .                                                                                                                                                                                
.
├── 1582.LC.fa
├── LI5896452.1.fa
└── test.fa

0 directories, 3 files

$ cat 1582.LC.fa                                                                                                                                                                        
>1582.LC madi kg 5/58/8
GATGAT

$ cat LI5896452.1.fa                                                                                                                                                                    
>LI5896452.1 Liverpool 2 kg/dp/Kng
ATGCTAG
ADD REPLY
0
Entering edit mode

This only works for the first line of sequences.

ADD REPLY
3
Entering edit mode
3.7 years ago
GenoMax 147k

faSplit utility from Jim Kent (LINK for linux version). Add execute permissions after you download (chmod a+x faSplit).

$ faSplit byname scaffolds.fa outRoot/ 
This breaks up scaffolds.fa using sequence names as file names.
       Use the terminating / on the outRoot to get it to work correctly.
ADD COMMENT
3
Entering edit mode
3.7 years ago

Another solution using AWK:

awk '/^>/ {out = substr($1, 2) ".fasta"; print > out} !/^>/ {print >> out}' genome.fasta
ADD COMMENT
0
Entering edit mode

This is great also for multiline fastas

ADD REPLY
2
Entering edit mode
3.7 years ago

Try seqkit split2

out=result

seqkit split2 --by-size 1 genomes.fasta -O $out

find $out -name "*.fasta" \
    | while read f; do \
        mv $f $out/$(seqkit seq --name --only-id $f).fasta; \
    done

Result

$ tree
.
├── genomes.fasta
└── result
    ├── 1582.LC.fasta
    └── LI5896452.1.fasta
ADD COMMENT
0
Entering edit mode

Thank you shenwei356 , However your script shows error for large dataset as follows,

[INFO] split seqs from genomic.fasta
[INFO] split into 1 seqs per file
[INFO] write 1 sequences to file: result/genomic.part_001.fasta
[INFO] write 1 sequences to file: result/genomic.part_002.fasta
-
-
-
-
-
:line 5:    /bin/ls: Argument list too long
ADD REPLY
1
Entering edit mode

Use find instead of ls.

find $out -name "*.fasta" \
    | while read f; do \
        mv $f $out/$(seqkit seq --name --only-id $f).fasta; \
    done

I'm not good at find -exec. You can also use find/fd + parallel

ADD REPLY
0
Entering edit mode

this means that you created too many files when splitting the original fasta file.

How many entries do you have in your original file? anything above 50-60k entries you will need to subdivide them in subfolders to remain workable.

ADD REPLY
0
Entering edit mode

The following command will give you the fasta files renamed by ID

seqkit split -i [input_fasta] --out-dir [output_directory]

They will be inside the output directory name though you have to rename them as they come with an automatic prefix

ADD REPLY
1
Entering edit mode

There's a flag to remove the prefix.

--by-id-prefix ""
ADD REPLY
2
Entering edit mode
3.7 years ago

Quick perl one-liner:

perl -ne 'if (/^>(\S+)/) { close OUT; open OUT, ">$1.fasta" } print OUT' genome.fasta
ADD COMMENT
1
Entering edit mode
3.7 years ago
Juke34 8.9k

In the subject, here a review about how to split fasta file https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/split_fasta.md Bash and faSplit approach do label fasta file by sequence name, for all other tools it is not mentioned but it does not mean they do not do it.

ADD COMMENT

Login before adding your answer.

Traffic: 2664 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6