Question: How do I change the fasta format from CLC Workbench for MG-RAST
3
0
Entering edit mode
9.8 years ago

I'm using a trial of CLC Workbench for assemblies. I would like to enter my assembled fa files into MG-RAST. However, CLC Workbench gives files in the form of:

>sequence_1 Average coverage: 5.6
ACCAGCGTTCTCTACACA
>sequence_2 Average coverage: 6.4
GTTATACAGGATAAGAATC

And so forth (of course, my contigs are much longer). MG-RAST request a format such as:

>sequence_1_[cov=5.6]
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4]
GTTATACAGGATAAGAATC

It is easy enough to get half-way there, and a code below (where BG1.fa is my input file and BGcon.fa is the new output file):

<BG1.fa sed 's/ Average coverage: /_[cov=/g' >BG1con.fa

Gets me to the following fa format:

>sequence_1_[cov=5.6
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4
GTTATACAGGATAAGAATC

But I just cannot get that last little bracket at the end. I've tried a couple of things, but it always puts the bracket on a new line such as:

>sequence_1_[cov=5.6
]
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4
]
GTTATACAGGATAAGAATC

I must apologize, for I am brand new to the sed language, and it still is pretty confusing for me.

Any idea how to eloquently (or not) get the last bracket up?

Assembly • 2.8k views
ADD COMMENT
0
Entering edit mode
9.8 years ago
...input commands... | awk '{if($0 ~ /^>/) {$1=$1"]"} print $0}' > output.fa
ADD COMMENT
0
Entering edit mode

You can very likely do this directly with sed too (I expect someone else will post that method).

ADD REPLY
0
Entering edit mode

It's getting much closer. Both the awk language suggestion and the sed language suggestion resulted in the "]" being placed in the correct spot, but the fist line of sequence is then tagged on the end of the header. So it looks like:

>sequence_1_[cov=5.6]ACCAGCGTTCTCTACACA
ATTACACGGCACCCAC
>sequence_2_[cov=6.4 ]GTTATACAGGATAAGAATC
GGCCCACTATTATATCA
ADD REPLY
0
Entering edit mode
9.8 years ago
Neilfws 49k

Awk works as Devon illustrated; the sed solution is:

sed -E 's/ Average coverage: (.+)/_[cov=\1]/' BG1.fa > BG1con.fa

The -E switch enables extended regular expressions; the \1 refers to everything that was captured following "Average coverage: ", so assumes that no header lines contain anything after the coverage value.

Solution was found here.

ADD COMMENT
0
Entering edit mode

It's getting much closer. Both the awk language suggestion and the sed language suggestion resulted in the "]" being placed in the correct spot, but the first line of sequence is then tagged on the end of the header. So it looks like:

>sequence_1_[cov=5.6]ACCAGCGTTCTCTACACA
ATTACACGGCACCCAC
>sequence_2_[cov=6.4 ]GTTATACAGGATAAGAATC
GGCCCACTATTATATCA
ADD REPLY
0
Entering edit mode

Not on my (Ubuntu Linux) machine. Could be a line endings issue with your OS.

ADD REPLY
0
Entering edit mode
9.8 years ago

Got it!

sed 's/ Average coverage: /_[cov=/g' BG.fa | sed 's/[0-9].[0-9][0-9]*/&]/g' >BGcon.fa

That took way longer for me to figure out than I'll ever admit to my supervisor.

Well, my day can only go downhill from here, I should just call it a day.

ADD COMMENT

Login before adding your answer.

Traffic: 1303 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6