Fasta header unique sequence
1
0
Entering edit mode
8.9 years ago
Mehmet ▴ 820

Hi,

I have a fasta file, which has some same headers like below. They have different sequence but same header. How can I merge them or what should I do? I want to run orthoMCL but it requires unique headers.

>c12358_g1_i9
>c12358_g1_i9
genome sequence • 3.6k views
ADD COMMENT
0
Entering edit mode

It seems that your upstream tool spit out different fragments of the same sequence. Merge them with same padding 'N' may work, but the quicker and better method is to make the headers unique.

ADD REPLY
0
Entering edit mode
8.9 years ago
biocyberman ▴ 870

I don't know about orthoMCL, but if you just want to change the header and make them unique, do the following (in linux, or install GnuWin32 from here for Windows to get gawk command)

gawk '{if ($0 ~/^>/) {h[$1]++; $1=$1 "_" h[$1]} print}' myfasta.fa > updatedIDs_myfasta.fa
# myfasta.fa is your fasta file.
ADD COMMENT
0
Entering edit mode

hi I used your command, but it didnt change the same header. Do you have any other solution?

ADD REPLY
0
Entering edit mode

That's weird, my gawk-fu can't be failing :-) Could you post an excerpt of the fasta file with sequences trimmed to about 10 bases?

ADD REPLY
0
Entering edit mode
>c10047_g1_i1|m.4145 c10047_g1_i1|g.4145  ORF c10047_g1_i1|g.4145 c10047_g1_i1|m.4145 type:complete len:387 (-) c10047_g1_i1:511-1671(-)</p>
>c10047_g2_i1|m.4146 c10047_g2_i1|g.4146  ORF c10047_g2_i1|g.4146 c10047_g2_i1|m.4146 type:5prime_partial len:589 (+) c10047_g2_i1:2-1768(+)

These are headers of my fasta file. The same headers I want to merge or remove for my next work. The headers have different sequence.

ADD REPLY
1
Entering edit mode

Oh, this is different from what you gave in the question. In your fasta file, the tools that generated it form unique headers like this: one header: c10047_g1_i1|m.4145; another header: c10047_g2_i1|m.4146 but orthoMCL propably only consider header before the pipe '|' signs. Therefore you can make this:

gawk 'BEGIN{FS=" "}{if ($0 ~/^>/){gsub("\\|", "pp", $1)} print}' myfasta.fa >updatedIDs_myfasta.fa

Change "pp" to anything you like, but keep it distinguishable.

ADD REPLY
0
Entering edit mode

Thank you so much, you saved my day :)

ADD REPLY

Login before adding your answer.

Traffic: 1833 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6