Question

How to split fasta by '>' into a file each containing one sequence, and have the name of that file be the ID?

5

Entering edit mode

6.9 years ago

SaltedPork ▴ 170

So far I have this

awk '/^>/{s=++d".fasta"} {print > s}' file.fasta

This splits the file just as I want it, but it produces new files called 1.fasta, 2.fasta, 3.fasta and so on. Is there a method of splitting it that has the new file name as the ID of the sequence inside?

Or failing that, is there a quick way of renaming fasta's based on their ID?

fasta bash split • 7.7k views

ADD COMMENT • link updated 6.9 years ago by Pierre Lindenbaum 164k • written 6.9 years ago by SaltedPork ▴ 170

score 4 · Answer 1 · 2018-01-03

4

Entering edit mode

6.9 years ago

GenoMax 147k

faSplit ( http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faSplit ) utility by Jim Kent from UCSC.

faSplit byname your_file.fa outRoot/

ADD COMMENT • link 6.9 years ago by GenoMax 147k

0

Entering edit mode

Thanks, this is good. I want to use this in combination with a find command, could you tell me why this isn't working?

for files in `find . -type f -name '*.consensus.fasta' -not -path "*/temp/*"`
do
    faSplit byname $files outRoot
done

ADD REPLY • link 6.9 years ago by SaltedPork ▴ 170

1

Entering edit mode

What is not working? Did you make a real directory to replace outRoot?

ADD REPLY • link 6.9 years ago by GenoMax 147k

0

Entering edit mode

Hi, yes I did make a more suitable directory! Just didn't include it because the name is sensitive. I meant just looking at the loop, It's so simple but It just doesn't work.

ADD REPLY • link 6.9 years ago by SaltedPork ▴ 170

1

Entering edit mode

You need to include the trailing / after the directory name for this to work right. Try this.

for files in `find . -type f -name '*.consensus.fasta' -not -path "*/temp/*"`
do
    faSplit byname $files outRoot/
done

ADD REPLY • link 6.9 years ago by GenoMax 147k

score 1 · Answer 2 · 2018-01-03

1

Entering edit mode

6.9 years ago

h.mon 35k

The perl script found here does what you want:

When creating this multi-entry FASTA file, one should take care to make the first word after the > symbol a unique value, as it will be used as the file name for that sequence.

ADD COMMENT • link 6.9 years ago by h.mon 35k

score 1 · Answer 3 · 2018-01-03

1

Entering edit mode

6.9 years ago

Pierre Lindenbaum 164k

create the filename with sprintf

   echo -e ">hello\nAAA\n>world\nATGCA" |\
    awk '/^>/ {fout=sprintf("%s.fasta",substr($0,2));}{print >> fout;}'

ADD COMMENT • link 6.9 years ago by Pierre Lindenbaum 164k