Remove text and keep '> + ID' in fasta file
4
1
Entering edit mode
6.8 years ago
Harumi ▴ 20

Hello,

I have multiple fasta sequences that are like this:

 >2p__scaffold_2__5799__6580__-__778568__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
 >2p__scaffold_2__5799__6580__+__778569__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
 >1p__scaffold_2__11235__11438__-__830827__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
 ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
 >1p__scaffold_2__33129__34129__+__811706__0.00__0.00
 GCTGGCGACGGATCTA

And I want to keep just the "> + ID" (numbers after __+/-__ and before __0.00_0.00)

So I expect an output like this:

>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC

I searched for it and tried this:

sed 's@.*__-__@@' input.fa > output.fa

That removed __-__ and everything before it, including the ">" that I wanted to keep.

I also tried this to remove everything between ">" and __-__

sed -e 's/\>//' -e 's/\__-__.*//' input.fa > output.fa

But this removed everything after __-__

And this, that removed __0.00_0.00

sed 's/__0.00.*$//' input.fa > output.fa

Thank you for your help.

fasta sed • 2.6k views
ADD COMMENT
1
Entering edit mode

Now THIS is how you write a "please help me with fasta headers" question!

ADD REPLY
4
Entering edit mode
6.8 years ago

Try the following regular expression:

$ awk '{ if ($0 ~ /^>/) { match($1, /[+|-]__[0-9]+__/, m); print ">"substr(m[0], 4, length(m[0]) - 5); } else { print $0; } }' input.fa > output.fa

Then:

$ less output.fa
>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA
ADD COMMENT
0
Entering edit mode

It worked! Thank you for your help!

ADD REPLY
1
Entering edit mode

You're quite welcome!

ADD REPLY
3
Entering edit mode
6.8 years ago
cschu181 ★ 2.8k

This should work if your headers follow the pattern that you specified:

sed 's/[>_]\+/_/g' yourfile.fasta | cut -f 8 -d _ | sed 's/^\([0-9]\)/>\1/'
ADD COMMENT
0
Entering edit mode

It worked! Thank you for your help!

ADD REPLY
3
Entering edit mode
6.8 years ago
test=">2p__scaffold_2__5799__6580__-__778568__0.00__0.00"
echo $test | sed 's@.*__[\+-]__@>@' | sed 's@__.*@@'

Some might find:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//'

to be a bit more readable.

ADD COMMENT
1
Entering edit mode

you will miss out on the __+__ cases with this one ;)

ADD REPLY
2
Entering edit mode

Ah, didn't see that at first, edited my answer to fit that requirement

ADD REPLY
0
Entering edit mode

Thank you for your help!

I tried:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//' input.fa > output.fa

But it took a long time processing so I canceled.

When I tried:

sed 's/.*__[\+-]__/>/' input.fa > output.fa | sed 's/__.*//' output.fa > output2.fa

The output was empty.

Why does this happen?

Thank you again!!

ADD REPLY
1
Entering edit mode

You need to put the input file before the '|' symbol, so like this:

sed 's/.*__[\+-]__/>/' input.fa | sed 's/__.*//' > output.fa

otherwise it is just waiting for input (== why it is taking so long)

the second is a wrong syntax and will indeed never work. The data stream stopped at ' > output.fa' so any pipe or such behind it will not do anything (and create empty file as you mention)

ADD REPLY
0
Entering edit mode

It worked! Thank you very much for your helpful explanation! I am still a beginner in bioinformatics

ADD REPLY
2
Entering edit mode
6.3 years ago

a little late to the party:

$ sed '/>/ s/^.*__\(\w\+\)__.*/>\1/g' file.fa

or

$ sed '/>/ s/^\(\W\).*__\(\w\+\)__.*/\1\2/g' file.fa 


>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA
ADD COMMENT
0
Entering edit mode

a little late to the party:

Always very welcome, though.

ADD REPLY

Login before adding your answer.

Traffic: 1806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6