Question

how to edit fasta header

0

Entering edit mode

21 months ago

fastamasterfromnow • 0

I would like to change the fasta header

For example, I have following sequences

>NR_130660.1 Hanseniaspora uvarum CBS 314 ITS region; from TYPE material
 AAGGATCATTAGATTGAATTATCATTGTTGCTCGAGTTCTTGTTTAGATCTTTTACAATAATGTGTATCT

>NR_131850.1 Cortinarius timiskamingensis NBM D. Malloch 3-9-81/2 ITS region; from TYPE material
 GGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTATTGAAATAAACCTGAT

>NR_171752.1 Melanconis marginalis subsp. europaea CBS 131692 ITS region; from TYPE material
AACGACCACCCAGGGCCGGAAACTTCTCCAAACTCGATCATTTAGAGGAAGTAAAAGTCGTAACAAGGTC

and my desired output is following:

>NR130660.1_Hanseniaspora_uvarum
 AAGGATCATTAGATTGAATTATCATTGTTGCTCGAGTTCTTGTTTAGATCTTTTACAATAATGTGTATCT

>NR131850.1_Cortinarius_timiskamingensis
 GGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTATTGAAATAAACCTGAT

>NR171752.1_Melanconis_marginalis_subsp._europaea
AACGACCACCCAGGGCCGGAAACTTCTCCAAACTCGATCATTTAGAGGAAGTAAAAGTCGTAACAAGGTC

Thank you in advance!

fasta • 1.7k views

ADD COMMENT • link updated 21 months ago by zx8754 12k • written 21 months ago by fastamasterfromnow • 0

2

Entering edit mode

The problem here is that from e.g.

>NR_130660.1 Hanseniaspora uvarum CBS 314 ITS region; from TYPE material

You want fields 1-3 (space delimited), whereas from e.g.

>NR_171752.1 Melanconis marginalis subsp. europaea CBS 131692 ITS region; from TYPE material

You want fields 1-5

What kind of rule makes both of these possible, e.g. in your file is it always: Accession<space>species name (n space delimited fields)<space>capital letter something? There's never a capital letter in the species field (other than the first letter)? A capital letter always follows the species name?

ADD REPLY • link 21 months ago by 5heikki 11k

3

Entering edit mode

The problem here is that from e.g.

I think the real problem here is that the OP has made no effort and wants us to do all the work.

ADD REPLY • link 21 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Sorry that if you felt and seemed that way. I am new to bioinformatics and was not sure about the data type. I have specified my question.

ADD REPLY • link 21 months ago by fastamasterfromnow • 0

0

Entering edit mode

Thank you for your reply. I am new to the field, and sorry that my question was vague. I wasn't sure about the data type. I tried sed 's/ /_/g' and _ part is fine now. Now I just want try for the field 1-3 (space delimited) as you said.

For example,

from

>NR_130660.1 Hanseniaspora uvarum CBS 314 ITS region; from TYPE material

to

>NR_130660.1 Hanseniaspora uvarum

ADD REPLY • link updated 21 months ago by GenoMax 147k • written 21 months ago by fastamasterfromnow • 0

0

Entering edit mode

https://stackoverflow.com/q/75623673/680068

ADD REPLY • link 21 months ago by zx8754 12k

score 6 · Answer 1 · 2023-03-05

It may be easier to get the organism via the accession by entrez-direct.

# conda install -c bioconda entrez-direct
esearch -db nucleotide -query 'NR_171752.1' \
    | esummary \
    | xtract -pattern DocumentSummary -element Organism
Melanconis marginalis subsp. europaea

The solution:

# conda install -c bioconda -c conda-forge entrez-direct seqkit parallel

$ seqkit seq -n -i seqs.fasta \
    | parallel -j 12 --line-buffer\
        'org=$(esearch -db nucleotide -query {} \
            | esummary \
            | xtract -pattern DocumentSummary -element Organism); \
        echo -e "{0}\t$org"' \
  | tee acc2org.tsv

NR_131850.1     Cortinarius timiskamingensis
NR_171752.1     Melanconis marginalis subsp. europaea
NR_130660.1     Hanseniaspora uvarum

$ seqkit seq -i seqs.fasta \
    | seqkit replace -k acc2org.tsv -p '(.+)' -r '$1 {kv}' \
    | seqkit replace -p '\s+' -r _

>NR_130660.1_Hanseniaspora_uvarum
AAGGATCATTAGATTGAATTATCATTGTTGCTCGAGTTCTTGTTTAGATCTTTTACAATA
ATGTGTATCT
>NR_131850.1_Cortinarius_timiskamingensis
GGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTATTGAAA
TAAACCTGAT
>NR_171752.1_Melanconis_marginalis_subsp._europaea
AACGACCACCCAGGGCCGGAAACTTCTCCAAACTCGATCATTTAGAGGAAGTAAAAGTCG
TAACAAGGTC

score 1 · Answer 2 · 2023-03-04

As already pointed out by 5heiki, the problem is to define the changes you want as a sequence of distinct conditions. If the remainder, which you want to truncate, would always start with either CBS or NBM, once could use this sequence to detect the part that should be dropped.

I have now opted for a different approach that might or might not work, depending on the sequence identifiers:

Check if the sequence identifier contains subsp. In that case, retain five fields and print them with underscores. An example of this would be "Melanconis marginalis subsp. europaea"
Check if the line contains ">NR". It is a sequence identifier, but not of a subspecies (For a subsp, only the previous condition is tested, since next will proceed to the next line in the file). Print fields 1-3 separated by underscores.
Neither condition was fulfilled (empty line or DNA sequence): Print it unmodified with $0.

awk '/subsp/{print $1"_"$2"_"$3"_"$4"_"$5;next};/>NR/{print $1"_"$2"_"$3;next};{print $0}' yourfile.fasta

I hope this gives you something that you can modify to match your file precisely.