Question

Delete character from sequence id

0

Entering edit mode

5.4 years ago

ericcabarroso • 0

Hi, I am trying to delete the NCBI accession numbers from the sequence ids in a fasta file.

Sequences ids look like:

>Elytraria_mexicana_JQ691768.1

I am trying things like

sed 's/_*.*//' myfile.fasta

or

sed 's/_*.*//g' myfile.fasta

They don't work.

Have any of you done this before?

Thanks for any input,

sed • 1.5k views

ADD COMMENT • link 5.4 years ago by ericcabarroso • 0

1

Entering edit mode

I would try simply sed 's/_[A-Z].[0-9]*.[0-9]//g' myfile.fasta

ADD REPLY • link 5.4 years ago by Prakash ★ 2.2k

0

Entering edit mode

You're using . as both a metacharacter and a literal .. Are you sure it will work reliably and the . that is supposed to match the literal . won't end up matching something else?

ADD REPLY • link 5.4 years ago by Ram 44k

0

Entering edit mode

yes I agree Ram, Here . may match anything. to make it more reliable we can use \. instead. Thanks

ADD REPLY • link 5.4 years ago by Prakash ★ 2.2k

0

Entering edit mode

Thanks!! It works!!

sed -r 's/_[A-Z0-9]+[.][0-9]+//g' aligned_trnG-trnS.fasta > new_trnG-trnS.fasta

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 5.4 years ago by ericcabarroso • 0

0

Entering edit mode

just cut:

 cut -d '_' -f 1,2 in.fasta

ADD REPLY • link 5.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you so much!!
This command works:

sed -r 's/_[A-Z0-9]+[.][0-9]+//g' aligned_trnG-trnS.fasta > new_trnG-trnS.fasta

=D

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 5.4 years ago by ericcabarroso • 0

0

Entering edit mode

Please stop adding answers. This content belongs as a reply to my comment. I'm moving it to a comment on the top level post now.

ADD REPLY • link 5.4 years ago by Ram 44k

Ram · Answer 1 · 2019-08-16

4

Entering edit mode

5.4 years ago

Ram 44k

Your sed is designed to look at each string once, and delete all occurrences of underscore followed by a character, removing just _J. Given that the Q is not preceded by an underscore, your pattern doesn't match it.

Try sed 's/_[A-Z0-9]+[.][0-9]+//g' myfile.fasta

ADD COMMENT • link 5.4 years ago by Ram 44k

0

Entering edit mode

Hi Ram, Thank you so much for your suggestion. this command

sed 's/_[A-Z0-9]+[.][0-9]+//g' myfile.fasta

Doesn't works. I am now trying something like

sed 's/_+[A-Z]+[A-Z]+[0-9]+[0-9]+[0-9]+[0-9]+[0-9]+[0-9]+[.]+[0-9]//g' myfile.fasta

And it also doesn't works. Would you have any sed manual to suggest? Many thanks!

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 5.4 years ago by ericcabarroso • 0

0

Entering edit mode

Try sed -r instead of just sed with the first command. The second one is a little too unnecessarily verbose.

ADD REPLY • link 5.4 years ago by Ram 44k