Question

Trim The Fasta Title

1

Entering edit mode

12.3 years ago

KJ Lim ▴ 140

Good day.

I parsed title of desire sequences from a Fasta file. The title of sequences are rather long, thus, I want to trim a bit the title with Unix sed command. The title is looks like:

AC166615 weakly similar to UniRef100_A1EGX0 Cluster: 1-aminocyclopropane-1-carboxylate bla bla bla

I would like to trim the sequence's title as:

AC166615      1-aminocyclopropane-1-carboxylate bla bla bla

I tried with the sed command as below:

sed 's/^.*\:/'$'\t''/g' seqTitle.txt

I got the output like below with the sequence ID removed as well. But, I wish to keep the sequence ID.

      1-aminocyclopropane-1-carboxylate bla bla bla

Could someone kindly please give me some guide about the Unix sed manipulation?

Thanks a lot. Have a nice weekend.

fasta • 4.5k views

ADD COMMENT • link updated 7.0 years ago by shubhra.bhattacharya ▴ 140 • written 12.3 years ago by KJ Lim ▴ 140

score 3 · Answer 1 · 2012-08-24

I can give you a small hack, if all the titles are like that, then you can cut the ID first and merge it with description that you are getting with your own sed code.

So, cut -f1 -d" " seqTitle.txt > id && sed 's/^.*\:/'$'\t''/g' seqTitle.txt > desc

paste -d"\t" id desc > wanted.txt

you can remove the tmp files produced

rm id desc

Cheers

score 2 · Answer 2 · 2012-08-24

2

Entering edit mode

12.3 years ago

Damian Kao 16k

My regex skills suck so I often find it a lot faster to just write a dirty one-liner python script. It's a bit of a linux hack, but it works:

echo "print '\n'.join([line.split()[0] + '\t' + line.split(': ')[-1].strip() for line in open('yourFile','r')])" | python

ADD COMMENT • link 12.3 years ago by Damian Kao 16k

score 1 · Answer 3 · 2012-08-24

1

Entering edit mode

12.3 years ago

Random ▴ 160

I find awk to be more intuitive:

awk 'BEGIN{OFS=":"}{split($0,a,":"); print $1,a[2]}'

But I also managed to do it with sed, albeit I doubt this is the best way to do it:

sed 's/\(\)\ .\+\(:\)/\1\2/'

I can't point you to a guide, except maybe for O'Reilly's "Sed and Awk", but I found this list of explained one-liners to be particularly useful to me:

The same site also has tips on awk and perl one-liners, in case you are interested.

ADD COMMENT • link 12.3 years ago by Random ▴ 160

0

Entering edit mode

Random, thanks for your suggestion. The sed command you mentioned here will have the ":" included: AC166615: 1-aminocyclopropane-1-carboxylate bla bla bla I think that should be fine. Thanks.

ADD REPLY • link 12.3 years ago by KJ Lim ▴ 140

0

Entering edit mode

For some reason I had assumed you wanted the ":" to separate the two fields. If you can still use awk and do:

  awk '{split($0,a,":"); print $1,a[2]}'

Or use sed and do:

sed 's/\(\)\ .\+:\(\)/\1\2/'

ADD REPLY • link 12.3 years ago by Random ▴ 160

0

Entering edit mode

Thanks for the suggestion.

ADD REPLY • link 12.3 years ago by KJ Lim ▴ 140

score 1 · Answer 4 · 2012-08-25

If I get you right, then you want to remove the string content between the accession and the description following ":". You can do that on a Mac with:

sed -E 's/( |       )[^:]+://'

Note 1: The big white space after the "|" symbol is a tab character, which I inserted by pressing Ctrl-V and then TAB (same on Mac).

Note 2: In Linux you need to replace the "-E" option with "-r".

The regexp itself works as follows:

find the earliest space or tab character ("( | )")
proceed as long as long as the letters are not a colon ("[^:]+")
match a single colon character (":")

Sed is instructed to:

carry out a substitution via "s"
the substituted text is an empty string "//"
for each line, carry out the match and replace only once (no "/g" option)

Hope that helps.

score 0 · Answer 5 · 2018-01-04

1)Suppose you have some 10,000 such headers. I think this approach should help. Keep all the headers in a file (file_name)

cut -d ":" -f2 file_name > temp2

will fetch you this part of your string: 1-aminocyclopropane-1-carboxylate bla bla bla

2)awk '{print $1}' names > temp1

will fetch you this part of your string: AC166615

3)paste -d ":" temp1 temp2 > final_headers.txt

will give you AC166615: 1-aminocyclopropane-1-carboxylate bla bla bla

4)rm temp*