Question

How to concatenate multiple fasta file

3

Entering edit mode

6.7 years ago

fec2 ▴ 50

Hi,

I am using a pipeline called PanX for phylogenomic study. One of my aim is to build a core-genome phylogenomic tree. However, the output fasta files (~1100 gene files) for core-genome gene set are all separated by gene as below:

GeneA

>My_bacteriaA_GeneA 
atgatg

>My_bacteriaB_GeneA
atgaag

>My_bacteriaC_GeneA
atgatg

GeneB

>My_bacteriaB_GeneB
atggtc

>My_bacteriaC_GeneB
atggtc

>My_bacteriaA_GeneB
atggta

And on and on until...

GeneZ

>My_bacteriaA_GeneZ
atggta

>My_bacteriaC_GeneZ
atggta

>My_bacteriaB_GeneZ
atggtg

I wish to have a concatenated fasta file that combined every core-genes for each bacteria as below to build a phylogenomic tree:

>My_bacteriaA
atgatgatggta...atggta

>My_bacteriaB
atgaagatggtc...atggtg

>My_bacteriaC
atgatgatggtc...atggta

May I know how can I create the concatenated fasta file? Please note that the order of the gene sequences for different bacteria of each gene in each fasta file are random, not in a particular order.

Thank you!

sequence genome • 18k views

ADD COMMENT • link updated 22 months ago by doggie • 0 • written 6.7 years ago by fec2 ▴ 50

1

Entering edit mode

Hi, SEDA (https://www.sing-group.org/seda) has an operation named "Concatenate sequences" (https://www.sing-group.org/seda/manual/operations.html#concatenate-sequences) aimed to do what you ask for. However, to use it, you need that the sequence names or identifiers to be identical, so you will need to perform a "Rename header" operation (https://www.sing-group.org/seda/manual/operations.html#rename-header) in order to split your sequence headers in "My_bacteriaA GeneA". Regards.

ADD REPLY • link 6.7 years ago by Hugo ▴ 380

0

Entering edit mode

Hello fec2,

should the (GeneA), (GeneB) ... should be inserted in the concatenated sequence? I guess this will be a problem in further downstream analysis.

Are sequences always in one line or is this a multiline fasta?

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Hi, the (GeneA), (GeneB) .. shouldn't be inserted, I put it in my question just to show I want to combine all the gene sequences. I have removed it to avoid confusion. Thank you.

ADD REPLY • link 6.7 years ago by fec2 ▴ 50

shenwei356 · Answer 1 · 2018-08-15

3

Entering edit mode

6.7 years ago

shenwei356 8.7k

# split species and genes
$ for f in *.fa; do seqkit replace -p '^(.+)_(.+?)$' -r '$1 $2' $f > f.$f; done

$ seqkit seq -n f.GeneA.fa 
My_bacteriaA GeneA
My_bacteriaB GeneA
My_bacteriaC GeneA

# concatenate seqs by IDs
$ seqkit concat f.*.fa
>My_bacteriaA
atgatgatggtaatggta
>My_bacteriaB
atgaagatggtcatggtg
>My_bacteriaC
atgatgatggtcatggta

ADD COMMENT • link 6.7 years ago by shenwei356 8.7k

0

Entering edit mode

@ shenwei356 is there an option to merge sequences in seqkit within a single fasta by ID?

ADD REPLY • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

Here's an example:

$ cat mix.fa 
>My_bacteriaA GeneA
a
>My_bacteriaA GeneB
c
>My_bacteriaB GeneA
g
>My_bacteriaB GeneB
t

Step 1. spliting by custom IDs

$ seqkit split --id-regexp "^.+ (.+?)$" -i mix.fa  -f -O out            
[INFO] split by ID. idRegexp: ^.+ (.+?)$
[INFO] read sequences ...
[INFO] read 4 sequences
[INFO] write 2 sequences to file: out/mix.id_GeneA.fa
[INFO] write 2 sequences to file: out/mix.id_GeneB.fa

$ cat out/mix.id_GeneA.fa 
>My_bacteriaA GeneA
a
>My_bacteriaB GeneA
g

Step 2. concat

$ seqkit concat out/*.fa
>My_bacteriaA
ac
>My_bacteriaB
gt

ADD REPLY • link 6.7 years ago by shenwei356 8.7k

0

Entering edit mode

Hi @shenwei356 , I am encountering similar problem, where each isolate (T_122, T_123) has two regions of sequence of different lengths, but I want to merge according to its isolates, and wanting to split the sequence according to the region in the format of chr:start-end (ie JA2.1:22400-23038 and JA2.1:1673805-1675711). Could you kindly advise me how shd I amend "--id-regexp".

    $ cat Merged.fa
>T_122_JA2.1:22400-23038
AAA
>T_122_JA2.1:1673805-1675711
C
>T_123_JA2.1:22400-23038
GGG
>T_123_JA2.1:1673805-1675711
T

These are just two for example, but I hope it looks like this for about 100 isolates, so that I can submit to iqtree.

 >T_122 
 AAAC
 >T_123
 GGGT

ADD REPLY • link 22 months ago by doggie • 0

0

Entering edit mode

Another common way:

Split by isolates

 $ seqkit split --id-regexp "^(.+)_[\w\.]+:.+" --by-id-prefix "" -i merged.fa -O tmp --force
 [INFO] split by ID. idRegexp: ^(.+)_[\w\.]+:.+
 [INFO] read sequences ...
 [INFO] read 4 sequences
 [INFO] write 2 sequences to file: tmp/T_122.fa
 [INFO] write 2 sequences to file: tmp/T_123.fa

concatenate all bases in each isolate

 mkdir out

 for f in tmp/*.fa; do
     iso=$(echo $f | sed -E 's/.+\///' | sed 's/.fa//')
     (echo ">$iso"; grep -v ">" $f) | seqkit seq > out/$iso.fasta
 done

    $ ls out/
    T_122.fasta  T_123.fasta

    $ cat out/*.fasta
    >T_122
    AAAC
    >T_123
    GGGT

ADD REPLY • link 22 months ago by shenwei356 8.7k

0

Entering edit mode

@shenwei356 Thank you so much! Your prompt reply helped me a lot!

ADD REPLY • link 22 months ago by doggie • 0

0

Entering edit mode

@shenwei356 My question is related to the one above (@fec2) but the headers are a bit more problematic. Is there a way to concatenate with SeqKit by taking into account only a part of the header (in this case, the acc. number)? The gene sequences are in different files. Thank you!

>ENA|MCFL01000433|MCFL01000433.1|F|SSU Extracted SSU sequence (7064 bp)
AACTTGTGCAGCAGCAAAAAAGGTGCAAAACTACAGCACCCAG...

>ENA|MCFL01000433|MCFL01000433.1|F|LSU Extracted LSU sequence (5892 bp)
CATATTAATAAGCGGAGGAAAAGAAACTAACAAGGATACCC...

ADD REPLY • link updated 5.5 years ago by shenwei356 8.7k • written 5.5 years ago by jan • 0

2

Entering edit mode

--id-regexp '^[^\|]+\|[^\|]+\|([^\|]+)\|'

ADD REPLY • link 5.5 years ago by shenwei356 8.7k

0

Entering edit mode

Great, thank you very much!

ADD REPLY • link 5.5 years ago by jan • 0

0

Entering edit mode

Is it also possible with SeqKit to concatenate based on the acc. number but keep (print) all the full names of the concatenated genes? E.g., AYX1000_SSU_gene & AYX1000_LSU_gene to: AYX1000_SSU_geneAYX1000_LSU_gene

ADD REPLY • link 5.4 years ago by jan • 0

score 2 · Answer 2 · 2018-08-17

2

Entering edit mode

6.7 years ago

finswimmer 16k

What about this (based on your header description here)?

$ cut -f1 -d "-" *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '{for(i=2; i<=NF; i++) {seq[$1] = seq[$1]$i}}; END {for(id in seq){print ">"id, seq[id]}}' > combined.fa

ADD COMMENT • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Thank you very much! This is working.

ADD REPLY • link 6.7 years ago by fec2 ▴ 50

0

Entering edit mode

Hi, I am having the same question as this one. But my headers look different, I tried your solution but it removes all of my headers. Would you be so kind to help me with it?

>Lineage-1-Name-L1
ATGCGGAAAGGGGTTGATCTCGTGACgGC....
>Lineage-2-y2007-018
ATGCGGAAAGGGGTTGATCTCGTGACgG...

Thanks a lot.

ADD REPLY • link 4.7 years ago by vujex ▴ 10

finswimmer · Answer 3 · 2018-08-15

0

Entering edit mode

6.7 years ago

finswimmer 16k

Hello,

here an awk solution, assuming that each sequence has only a single line:

cat *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 {match($1, /^(.+)_[^_]+$/, name); seq[name[1]] = seq[name[1]]$2}; END {for(id in seq){print ">"id, seq[id]}}' > combined.fa

fin swimmer

ADD COMMENT • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Hi,

I got an error as below:

awk: line 1: syntax error at or near ,

May I know is it because my fasta file is not a single line fasta file?

Thank you.

ADD REPLY • link updated 6.7 years ago by finswimmer 16k • written 6.7 years ago by fec2 ▴ 50

0

Entering edit mode

Hello again,

no the syntax error is not due to multiline fasta. Maybe you missed something to copy from my command?

As I said, the command above doesn't work for multiline fasta. For this try this:

$ cat *.fa|awk -v RS=">" -v FS="\n" -v OFS="\n" '$0 {match($1, /^(.+)_[^_]+$/, name); for(i=2; i<=NF; i++) {seq[name[1]] = seq[name[1]]$i}}; END {for(id in seq){print ">"id, seq[id]}}' > combined.fa

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Hi, I got the same error, below is an example of my sequence ID:

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>

May I know how to see whether the fasta file is multiline or single line?

Thank you.

ADD REPLY • link updated 6.7 years ago by finswimmer 16k • written 6.7 years ago by fec2 ▴ 50

0

Entering edit mode

fec2 for making zero length fasta files: try seqkit seq -w 0 input.fa

ADD REPLY • link 6.7 years ago by cpad0112 21k

finswimmer · Answer 4 · 2018-08-15

0

Entering edit mode

6.7 years ago

cpad0112 21k

Assuming that sequences in files are in exact OP format: output:

$ awk -v RS=">" -v OFS="\t" '{print $1,$2}' *.fa | sed '/^\s*$/d;s/_\w\{5\}//2g' | datamash -sg 1 collapse 2 | sed 's/,//g' | awk -v OFS="\n" '{print ">"$1,$2}'

>My_bacteriaA
atgatgatggtaatggta
>My_bacteriaB
atgaagatggtcatggtg
>My_bacteriaC
atgatgatggtcatggta

input (from OP);

$ tail -n +1  *.fa
==> genea.fa <==
> My_bacteriaA_GeneA
atgatg
> My_bacteriaB_GeneA
atgaag
> My_bacteriaC_GeneA
atgatg

==> geneb.fa <==
> My_bacteriaB_GeneB
atggtc
> My_bacteriaC_GeneB
atggtc
> My_bacteriaA_GeneB
atggta

==> genez.fa <==
> My_bacteriaA_GeneZ
atggta
> My_bacteriaC_GeneZ
atggta
> My_bacteriaB_GeneZ
atggtg

ADD COMMENT • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

Thank you, but is not really working, there is no output file in my folder.

Here is an example of my input sequence ID:

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>

ADD REPLY • link 6.7 years ago by fec2 ▴ 50

0

Entering edit mode

Hello fec2,

Please use the formatting bar (especially the code option) to present your post better.

code_formatting

I cannot clearly see how your sequence ID looks like.

Thank you!

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

I am sorry I am new in this field and this forum. I have edited the comment above. Thank you.

ADD REPLY • link 6.7 years ago by fec2 ▴ 50

0

Entering edit mode

hi fec2 could you please post few more headers? pattern can be improved if we have more fasta headers.

ADD REPLY • link 6.7 years ago by cpad0112 21k

0

Entering edit mode

Hi, here is example of 1 gene file:

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_kocurii_ATCC_43650-AUO94_RS06550-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_antarcticus_DSM14505-BBH88_RS02420-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_versutus_L10_15-I858_RS02180-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_halocryophilus_DSM24743-BBI08_RS02205-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_sp_PAMC_21323-PLANO_RS01740-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_donghaensis_MPA1U2-GPDM_RS01230-63-cydC_thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_donghaensis_DSM22276-BCM40_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_sp_SCU63-DP120_RS12840-13-cydC_thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_sp_CAU13-KW93_RS09120-323-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_maritimus_Y42-B0X71_RS16755-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_maritimus_SAMP-BHE17_RS12170-31-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>
>Planococcus_maritimus_MKU009-AY633_RS12345-32-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>

ADD REPLY • link updated 6.7 years ago by finswimmer 16k • written 6.7 years ago by fec2 ▴ 50

0

Entering edit mode

Could you please explain how the bacterias are delimited from the genes?

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

The pipeline using NCBI genebank file as input.

The ID actually contain

>Bacteria species and strain name-locus tag of the gene-gene name

For example for

>Planococcus_faecalis_AJ003-AJGP001_RS02220-1-thiol_reductant_ABC_exporter_subunit_CydC <unknown description>

>Planococcus_faecalis_AJ003(Bacteria species and strain name)-AJGP001_RS02220(locus tag of the gene)-1-thiol_reductant_ABC_exporter_subunit_CydC(gene name) <unknown description>

Thank you.

ADD REPLY • link 6.7 years ago by fec2 ▴ 50

0

Entering edit mode

Please show REAL example ASAP.

$ for f in *.fa; do seqkit replace -p '^(.+?)-(.+)$' -r '$1 $2' $f > f.$f; done

$ seqkit concat f.*.fa

ADD REPLY • link 6.7 years ago by shenwei356 8.7k

0

Entering edit mode

Sorry for not showing the real example in the beginning. The command is working but the ID of each bacteria all become 'S1'. I can't differentiate them. May I know have I done anything wrong?

I also noticed that for each f.*.fa file, the multiple fasta are in different order, I guess it just removed the bacteria species and strain name. As I have raised in the beginning, all the multiple fasta sequence are in random order in each file. Will it be affected when I concatenate them?

ADD REPLY • link 6.7 years ago by fec2 ▴ 50