I want to delete the part starting after ">" until "cds_". And the characters after the accession. In the example "_1" and "_2". In the original file this counts up to 784, so "_784". Can someone help me with a solution? Would be great.
From this
>lcl|NC_002712.2_cds_NP_268424.1_1 [gene=dnaZA]
ATGACTG
>lcl|NC_002721.2_cds_NP_268453.1_2 [gene=dnaGC]
ATGTTCG
To
>NP_268424.1 [gene=dnaZA]
ATGACTG
>NP_268453.1 [gene=dnaGC]
ATGTTCG
Having spaces in fasta headers may look visually appealing. Keep in mind that if you were to use this file for alignments etc most aligners will drop all text after they encounter first space in a fasta header. So you will lose that gene name.
Thanks for the heads up! I downloaded this from GenBank. I will include tr " " "_" in my unix command.
On the same train of thought as @genomax, I would say that also brackets, pipes and equals could be avoided. Not because they could damage you now, but because you'll never know what you'll need these data for in the future. Those characters could easily mess up your future pipelines!
Something like:
Would probably be the best way to ensure no problems in the future while retaining all the necessary info.
I wanted to remove the name between fasta files from command line and merge them together. for example
as:
How can i make it using sed or another tool? Any help is appreciated! Thanks
This is a different question, so please post it as such and not as a comment underneath another question. Besides, please have a look whether your desired output is exactly as you want it (e.g. there are whitespaces which I believe you don't want in there).