I'm using a trial of CLC Workbench for assemblies. I would like to enter my assembled fa files into MG-RAST. However, CLC Workbench gives files in the form of:
>sequence_1 Average coverage: 5.6
ACCAGCGTTCTCTACACA
>sequence_2 Average coverage: 6.4
GTTATACAGGATAAGAATC
And so forth (of course, my contigs are much longer). MG-RAST request a format such as:
>sequence_1_[cov=5.6]
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4]
GTTATACAGGATAAGAATC
It is easy enough to get half-way there, and a code below (where BG1.fa is my input file and BGcon.fa is the new output file):
<BG1.fa sed 's/ Average coverage: /_[cov=/g' >BG1con.fa
Gets me to the following fa format:
>sequence_1_[cov=5.6
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4
GTTATACAGGATAAGAATC
But I just cannot get that last little bracket at the end. I've tried a couple of things, but it always puts the bracket on a new line such as:
>sequence_1_[cov=5.6
]
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4
]
GTTATACAGGATAAGAATC
I must apologize, for I am brand new to the sed
language, and it still is pretty confusing for me.
Any idea how to eloquently (or not) get the last bracket up?
You can very likely do this directly with
sed
too (I expect someone else will post that method).It's getting much closer. Both the awk language suggestion and the sed language suggestion resulted in the "]" being placed in the correct spot, but the fist line of sequence is then tagged on the end of the header. So it looks like: