Hi all,
I was wondering if anyone could help me with renaming the header lines on my FASTA files?
I am using a simulated paired end metagenome. The data currently looks like:
>r1.1 |SOURCES={KEY=bf97e692...,bw,559392-559472}|ERRORS={}|SOURCE_1="CP007128.1 Gemmatimonadetes bacterium KBS708, complete genome" (bf97e6923cd410b05af0dc7641aa6e2651e19392)
GTCGCTGCAGGGGCGCGACTCGGCGCGCGTGCGCGACTCGGCGCGCGTGCGCGACTTCGCGCTCT
ACGGCGAGACGACGG
>r1.2 |SOURCES={KEY=bf97e692...,fw,558357-558437}|ERRORS={}|SOURCE_1="CP007128.1 Gemmatimonadetes bacterium KBS708, complete genome" (bf97e6923cd410b05af0dc7641aa6e2651e19392)
GAGGGCGGCTTCCACCCCGGCACCGGCCTGGCCGCCGATCGCCTCGTCGGCATGACGAAGCTCGC
CGGCGAGTGCCGTAC
>r2.1 |SOURCES={KEY=bf97e692...,bw,4893168-4893248}|ERRORS={}|SOURCE_1="CP007128.1 Gemmatimonadetes bacterium KBS708, complete genome" (bf97e6923cd410b05af0dc7641aa6e2651e19392)
TGGAACAGCTCGTCGCGGGCTTCCTCGTAGGGCGTCGGGGTCGCGACAGCATCCCGTCGTCCGCG
GTTGTTATTGCCGTG
>r2.2 |SOURCES={KEY=bf97e692...,fw,4892115-4892195}|ERRORS={76:T}|SOURCE_1="CP007128.1 Gemmatimonadetes bacterium KBS708, complete genome" (bf97e6923cd410b05af0dc7641aa6e2651e19392)
ACTAGATTGACGACGAA*
Where >r1.1 and .r1.2 are a set of paired end reads. In order to run MIRA, i need to rename the header lines into a more conventional format, i.e. >r1/1 and >r1/2.
Does anyone know how I can do this and apply it to my whole FASTA files, so that I have reads numbered : 1/1, 1/2, 2/1, 2/2, 3/1, 3/2 and so on...
Thanks for your help!
Maisie
I guess you can do this easily with
sed
. Do you want to delete the remaining parts? e.g.