HOw to merge multifasta sequence into a single sequence having only one header?
4
0
Entering edit mode
6.8 years ago
majeedaasim ▴ 60

I have a multifasta sequence file. I want to merge all the sequences together to create a single sequence file. I men that the ">IDs" in the sequences be removed to create a super sequence. THis would take much time doing mannualy.

how can it be done in linux

THanks

fasta merge • 12k views
ADD COMMENT
0
Entering edit mode

If I may ask, for what need?

ADD REPLY
0
Entering edit mode

@majeedaasim please choose the accept answer option if it works for you, It will help us motivated. Good Luck!

ADD REPLY
5
Entering edit mode
6.8 years ago
Charles Plessy ★ 2.9k

Using the union command from the EMBOSS package:

$ cat test.fasta 
>seq1
AAAATTGGG
>seq2
GGCCCTTTT
>seq3
AAATGGGG

$ union -filter test.fasta
>seq1
AAAATTGGGGGCCCTTTTAAATGGGG
ADD COMMENT
3
Entering edit mode
6.8 years ago
mittu1602 ▴ 200

grep -v "^>" test.fasta | awk 'BEGIN { ORS=""; print ">My_New_Sequence_name\n" } { print }' > new​.fasta

test.fasta
>seq1
AAAATTGGG
>seq2
GGCCCTTTT
>seq3
AAATGGGG

new.fasta
>My_New_Sequence_name
AAAATTGGGGGCCCTTTTAAATGGGG
ADD COMMENT
0
Entering edit mode

Hi,

Would this work on a mac osx?

ADD REPLY
0
Entering edit mode

Not necessarily. MacOS ships with a non standard version of grep (I.e. not GNU coreutils). Consequently, the syntax often isn't 100% transferable. It may work, but that's not something you can rely on. You can however download and install the 'proper' coreutils via HomeBrew or MacPorts.

ADD REPLY
2
Entering edit mode
6.8 years ago
Joe 21k
cat multifasta.fa | sed -e '1!{/^>.*/d;}' | sed  ':a;N;$!ba;s/\n//2g' > output.fa

E.g:

$ cat ~/test/seqs.fasta
>tpg|Magnaporthiopsis_incrustans|JF414846
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGCNNNNNN
>tpg|Pyricularia_pennisetigena|AB818016
NNNNNNGCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>tpg|Inocybe_sororia|EU525947
NNNAACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGCNNN


$ cat ~/test/seqs.fasta | sed -e '1!{/^>.*/d;}' | sed  ':a;N;$!ba;s/\n//2g'
>tpg|Magnaporthiopsis_incrustans|JF414846
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGCNNNNNNNNNNNNGCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAANNNAACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGCNNN

(retains just the header of the first seq in the multifasta)

Bonus:

If you also want to hard line-wrap the fasta to 80 chars (or whatever), the command becomes;

cat $1 | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' | sed '1!s/.\{80\}/&\n/g'
ADD COMMENT
0
Entering edit mode

Can you keep it the last file .gz? Thanks

ADD REPLY
1
Entering edit mode

You can pipe the output of the command to gzip - just tell it to use STDIN as the data source.

ADD REPLY
0
Entering edit mode
6.8 years ago
yhoogstrate ▴ 150

grep -v '^>' in.fa > out.fa

if in.fa =

>chr1
ttttccccaaaagggg
>chr2
ACTGACTGnnnnACTG
>chr3.1
ACTGACTGaaaac
>chr3.2
ACTGACTGaaaacc
>chr3.3
ACTGACTGaaaaccc
>chr4
ACTGnnnn
>chr5
nnACTG

then out.fa becomes:

ttttccccaaaagggg
ACTGACTGnnnnACTG
ACTGACTGaaaac
ACTGACTGaaaacc
ACTGACTGaaaaccc
ACTGnnnn
nnACTG
ADD COMMENT

Login before adding your answer.

Traffic: 2500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6