Question

sequence splitting

0

Entering edit mode

2.6 years ago

zhichusun ▴ 10

I have a fasta file which contains multiple contigs

>DEHFGCMO_00205
>MDDGIGEH_00111
>FLCICGHF_00226
>FLCICGHF_00253
>DEHFGCMO_01539
>MDDGIGEH_00625

I want to split the contigs based on the first few letters of their names and aggregate them into different fasta files e.g. 1.fasta

>DEHFGCMO_00205 
>DEHFGCMO_01539

2.fasta

>MDDGIGEH_00111 
>MDDGIGEH_00625

3.fasta

>FLCICGHF_00226 
>FLCICGHF_00253

what should I do? Very grateful for your help.

sequence • 963 views

ADD COMMENT • link updated 2.6 years ago by cpad0112 21k • written 2.6 years ago by zhichusun ▴ 10

0

Entering edit mode

Assuming that sequences are single line and sequence names/ids follow similar pattern:

$ awk -F '[>_]' '/^>/ {getline seq;print $0"\n"seq > $2".fa"}' test.fa

ADD REPLY • link 2.6 years ago by cpad0112 21k

score 1 · Answer 1 · 2022-04-07

seqkit split

$ seqkit split --by-id  --id-regexp "^(.+?)_" test.fasta -O result
[INFO] split by ID. idRegexp: ^(.+?)_
[INFO] read sequences ...
[INFO] read 6 sequences
[INFO] write 2 sequences to file: result/test.id_DEHFGCMO.fasta
[INFO] write 2 sequences to file: result/test.id_MDDGIGEH.fasta
[INFO] write 2 sequences to file: result/test.id_FLCICGHF.fasta