>NR_130660.1 Hanseniaspora uvarum CBS 314 ITS region; from TYPE material
AAGGATCATTAGATTGAATTATCATTGTTGCTCGAGTTCTTGTTTAGATCTTTTACAATAATGTGTATCT
>NR_131850.1 Cortinarius timiskamingensis NBM D. Malloch 3-9-81/2 ITS region; from TYPE material
GGAAGTAAAAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTATTGAAATAAACCTGAT
>NR_171752.1 Melanconis marginalis subsp. europaea CBS 131692 ITS region; from TYPE material
AACGACCACCCAGGGCCGGAAACTTCTCCAAACTCGATCATTTAGAGGAAGTAAAAGTCGTAACAAGGTC
>NR_130660.1 Hanseniaspora uvarum CBS 314 ITS region; from TYPE material
You want fields 1-3 (space delimited), whereas from e.g.
>NR_171752.1 Melanconis marginalis subsp. europaea CBS 131692 ITS region; from TYPE material
You want fields 1-5
What kind of rule makes both of these possible, e.g. in your file is it always: Accession<space>species name (n space delimited fields)<space>capital letter something? There's never a capital letter in the species field (other than the first letter)? A capital letter always follows the species name?
Thank you for your reply.
I am new to the field, and sorry that my question was vague.
I wasn't sure about the data type.
I tried sed 's/ /_/g' and _ part is fine now.
Now I just want try for the field 1-3 (space delimited) as you said.
For example,
from
>NR_130660.1 Hanseniaspora uvarum CBS 314 ITS region; from TYPE material
As already pointed out by 5heiki, the problem is to define the changes you want as a sequence of distinct conditions. If the remainder, which you want to truncate, would always start with either CBS or NBM, once could use this sequence to detect the part that should be dropped.
I have now opted for a different approach that might or might not work, depending on the sequence identifiers:
Check if the sequence identifier contains subsp. In that case, retain five fields and print them with underscores. An example of this would be "Melanconis marginalis subsp. europaea"
Check if the line contains ">NR". It is a sequence identifier, but not of a subspecies (For a subsp, only the previous condition is tested, since next will proceed to the next line in the file). Print fields 1-3 separated by underscores.
Neither condition was fulfilled (empty line or DNA sequence): Print it unmodified with $0.
The problem here is that from e.g.
You want fields 1-3 (space delimited), whereas from e.g.
You want fields 1-5
What kind of rule makes both of these possible, e.g. in your file is it always: Accession<space>species name (n space delimited fields)<space>capital letter something? There's never a capital letter in the species field (other than the first letter)? A capital letter always follows the species name?
I think the real problem here is that the OP has made no effort and wants us to do all the work.
Sorry that if you felt and seemed that way. I am new to bioinformatics and was not sure about the data type. I have specified my question.
Thank you for your reply. I am new to the field, and sorry that my question was vague. I wasn't sure about the data type. I tried sed 's/ /_/g' and _ part is fine now. Now I just want try for the field 1-3 (space delimited) as you said.
For example,
from
to
https://stackoverflow.com/q/75623673/680068