Would you please give me some advices about how to change header?
3
0
Entering edit mode
3.2 years ago
Riku ▴ 80

Hi, all.

I would like to edit headers from fasta.

I have fasta with random header as following(headers are separated by space);

>3R5.1a wormpep=CE24758 gene=WBGene00007065 locus=pot-3 status=Confirmed uniprot=G5EFG7 insdc=CAA21777.2 product="POT1PC domain-containing protein"
>2RSSE.1a wormpep=CE32785 gene=WBGene00007064 locus=rga-9 status=Confirmed uniprot=A4F337 insdc=CCD61138.1 product="Rho-GAP domain-containing protein"
>2L52.1a wormpep=CE32090 gene=WBGene00007063 status=Confirmed uniprot=A4F336 insdc=CCD61130.1

I would like to edit it like a following;

>3R5.1a gene=WBGene00007065 locus=pot-3    
>2RSSE.1a gene=WBGene00007064 locus=rga-9
>2L52.1a gene=WBGene00007064 -

In fact, this file is so long that command operations are required as you know. But I don't know how to edit random header. Could you please give me a help?

I really appreciate your help in advance! Thank you.

bash Linux BLAST fasta • 1.8k views
ADD COMMENT
1
Entering edit mode

One lazy way:

$ cut -d " " -f1,3,4 your.fa > new.fa

If locus field is not present in every header then whatever is next will be picked up e.g. status in example above in line 3.

ADD REPLY
1
Entering edit mode
3.2 years ago

awk -F 'status' '{print $1}' file.fa > new_file.fa

ADD COMMENT
2
Entering edit mode

This will include wormpep field which is not wanted in example by OP. That can be eliminated by sed.

awk -F 'status' '{print $1}' file.fa | sed 's/wormpep=CE.....//' > new.fa
ADD REPLY
0
Entering edit mode

Oh, good call, didn't even notice that.

ADD REPLY
0
Entering edit mode

I see. It helped a lot. I was able to solve this problem with the help of GenoMax and Jared advices.

$ awk -F 'status' '{print $1}' file.fa | cut -d " " -f 1,3,4 > new.fa

But the method by sed is simpler. I will use it.

I'm really appreciated both of you for your quick answer. Thank you very much!

ADD REPLY
1
Entering edit mode
3.2 years ago

This is one of those problems that are hard to solve in a reliable manner since the delimiters are inconsistent. Spaces separate both the fields and internal content that in turn are protected with quotes. You would need to write a parser regardless, then depending on the file structure the implementation may be quite difficult.

As long as the desired key=value pairs are listed before other fields with quoted whitespaces, the python code could look like this:

import sys

targets = { 'gene', 'locus'}

for line in open('mydata.fa'):

    line = line.strip()

    if line.startswith(">"):

        pieces = line.split(" ")

        parts = [pieces[0]] + [p for p in pieces if p.split("=")[0] in targets]

        line = " ".join(parts)

        if "locus" not in line:
            line += " -"

    print(line)

run it as:

python fixme.py 
ADD COMMENT
0
Entering edit mode

I don't understand python, but it was good to know there's way to do this. I also try to this.

I appreciated your quick answer!

ADD REPLY
1
Entering edit mode
3.2 years ago

with seqkit:

$ seqkit replace  -p '(.*) worm.* (gene=\w+) (locus=\w+\W\w+)*.*' -r '${1} ${2} ${3}' test.fa     

>3R5.1a gene=WBGene00007065 locus=pot-3
atgc
>2RSSE.1a gene=WBGene00007064 locus=rga-9
atc
>2L52.1a gene=WBGene00007063 
gtc

with sed:

$ sed -r '/^>/ s/(>[A-Za-z0-9.]+).*worm.*\s(gene=\w+)\s(locus=\w+\W[0-9]|-)*.*/\1 \2 \3-/g' test.fa | sed -r 's/([0-9])-$/\1/g'

>3R5.1a gene=WBGene00007065 locus=pot-3
atgc
>2RSSE.1a gene=WBGene00007064 locus=rga-9
atc
>2L52.1a gene=WBGene00007063 -
gtc

with awk (Assuming that locus is alway present after wormrep and gene):

$ awk '/^>/ {print $1,$3, ($4 ~ /locus/) ? $4: "-"};!/^>/'  test.fa
$ awk '/^>/ {print $1,$3, ($4 ~ /locus/) ? $4: "-";next}1'  file.fa

>3R5.1a gene=WBGene00007065 locus=pot-3
atgc
>2RSSE.1a gene=WBGene00007064 locus=rga-9
atc
>2L52.1a gene=WBGene00007063 -
gtc
ADD COMMENT
0
Entering edit mode

I moved it to an answer, probably the simplest approach that also generalizes well.

ADD REPLY

Login before adding your answer.

Traffic: 1960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6