Question

Split ordered fasta file by alphabetical group using awk

1

Entering edit mode

7.1 years ago

Josh Herr 5.8k

I know there are a ton of awk one-liners on here for splitting a fasta file, but here's one I have not been able to get to work or find an answer for.

A simple tool would be helpful, but I tried to use pyfasta and seqtk to no avail.

Please excuse me if someone has answered this one before, but I have googled and biostared for a while with no awk solution in sight.

A collaborator passed me a fasta file with the output from OrthoMCL - clustered genes in a single fasta file. A clustered group of genes in the file is listed alphabetically by the organism it was found in. Genes for some organisms are not present, so I can't split on the number of total organisms represented across all the fasta headers.

Any advice how to split a fasta file when the first two characters of the header is >A so that I have many fasta files where each clustered gene has it's own fasta file? There are multiple organisms with A as the first letter so I don't want to split just on A - I want to split before the first A in a series only.

awk fasta one-liner • 2.4k views

ADD COMMENT • link updated 7.1 years ago by cpad0112 21k • written 7.1 years ago by Josh Herr 5.8k

1

Entering edit mode

@OP: Good description of data. An example fasta and expected output would be helpful.

ADD REPLY • link 7.1 years ago by cpad0112 21k

0

Entering edit mode

Absolutely...

The original file is over 3 million sequences ordered like this:

>A_sp 8272638
AGGAGAGAGAAGCTGGGACTACTA
>A_sp_2 3882626
AGGAGAGAGAAGGGGGGACTACTA
>A_sp_3 9748636
AGGAGAGAGAAGCTACTACTGGGA
>B_sp 29384723
AGGAGAGAGAAGCTACTGGGGCTA
>M_sp 2863762
AGGAGAGAGAAGCTACTACTACTA
>T_sp 28736
AGGAGAGAGAAGCTACTACGGCTA
>X_sp 28736
AGGAGAGAGAAGCTACTACTACTA
>A_sp 38492638
CCTCTCTCCCTCTTCCGGGCTATA
>A_sp_3 23534532
CCTCTCTCCCTCCTCCGGGCTATA
>C_sp 3455354
CCTCTCTCCCTCAACCGGGCTATA
>F_sp 456434
CCTCTCTCCCTCTTCCGTGCTATA
>G_sp 6345634
CCTCTCTCCCTCTTCCGTGCTATA
>M_sp 2863762
CCTCTCTCCCTCCCTTGTGCTATA
>T_sp 28736
CCTCTCTCCAACCCCCGTGCTATA
>X_sp 28736
CCTCTCTCCCTAACCCGGGCTATA
>A_sp 543646
GGGAGAGAGAAGCTACTACTACTA
>A_sp_4 624634
GGGAGAGAGAAGCTACTAAAACTA
>A_sp_7 4265436
GGGAGAGAGAAGCTAAAAAAACTA
>B_sp 29384723
GGGAGAGAGAAAAAAAAAAAACTA
>O_sp 235434654
GGGAGAGAGAAAAAAAAAAAACTA
>S_sp 456345
GGGAGAGAGAAGCTACTACTACTA
>Z_sp 45645
GGGAGAGAGAAGCTACTACTACTA
>A_sp 8272638
AGAAAAAAAAAAAAAATACTACTA

and I am looking to parse the file like this:

file 1:

>A_sp 8272638
AGGAGAGAGAAGCTGGGACTACTA
>A_sp_2 3882626
AGGAGAGAGAAGGGGGGACTACTA
>A_sp_3 9748636
AGGAGAGAGAAGCTACTACTGGGA
>B_sp 29384723
AGGAGAGAGAAGCTACTGGGGCTA
>M_sp 2863762
AGGAGAGAGAAGCTACTACTACTA
>T_sp 28736
AGGAGAGAGAAGCTACTACGGCTA
>X_sp 28736
AGGAGAGAGAAGCTACTACTACTA

file 2:

>A_sp 38492638
CCTCTCTCCCTCTTCCGGGCTATA
>A_sp_3 23534532
CCTCTCTCCCTCCTCCGGGCTATA
>C_sp 3455354
CCTCTCTCCCTCAACCGGGCTATA
>F_sp 456434
CCTCTCTCCCTCTTCCGTGCTATA
>G_sp 6345634
CCTCTCTCCCTCTTCCGTGCTATA
>M_sp 2863762
CCTCTCTCCCTCCCTTGTGCTATA
>T_sp 28736
CCTCTCTCCAACCCCCGTGCTATA
>X_sp 28736
CCTCTCTCCCTAACCCGGGCTATA

file 3:

>A_sp 543646
GGGAGAGAGAAGCTACTACTACTA
>A_sp_4 624634
GGGAGAGAGAAGCTACTAAAACTA
>A_sp_7 4265436
GGGAGAGAGAAGCTAAAAAAACTA
>B_sp 29384723
GGGAGAGAGAAAAAAAAAAAACTA
>O_sp 235434654
GGGAGAGAGAAAAAAAAAAAACTA
>S_sp 456345
GGGAGAGAGAAGCTACTACTACTA
>Z_sp 45645
GGGAGAGAGAAGCTACTACTACTA

file 4:

>A_sp 8272638
AGAAAAAAAAAAAAAATACTACTA

and so on...

Note that the sequences are similar but not identical so I can't separate by sequence. I can recluster, but with millions of sequences I am trying to avoid this and quickly parse the file.

I'm fairly certain from the gene clustering that each section of each gene begins with an A taxa, but I am not 100% certain.

ADD REPLY • link 7.1 years ago by Josh Herr 5.8k

0

Entering edit mode

Thanks. In short you would like to break every before A_sp and send it to a new file. @OP

ADD REPLY • link 7.1 years ago by cpad0112 21k

score 2 · Answer 1 · 2018-07-09

2

Entering edit mode

7.1 years ago

Pierre Lindenbaum 166k

awk '($1==">A_sp") {n++;close(out);out=sprintf("chunk%d.fa",n);} {print >> out;}' input.fa

ADD COMMENT • link 7.1 years ago by Pierre Lindenbaum 166k

score 0 · Answer 2 · 2018-07-09

Try this: fa files would be numbered 1.fa,2.fa,3.fa,4 so on, It would create many files. Do it a different folder.

awk '/>A_sp / {i++}{print > i".fa"}' test.fa

For the example post above, it would create 4 files. or you can use csplit in following way:

$ csplit  -z -b "%01d.fa"  -f split test.fasta /\>A_sp\ / '{*}'

test.fasta is from OP. This would create 4 files with names: split0.fa, split1.fa and so on so forth.