Hello,
I am creating a custom TE library and need fasta file headers to be in a specific format.
If I have file 1 with headers like such:
>L2-10_EL__1_000087d4-94a9-4af9-a82b-db9caeebb418--3803-3889 LINE/L2__frg=1__len=87_st=C_div=21.6_sp=idaho.fa
AAGTGACGTTCTCAGCAATCTTGGAGATGTTGTAAGGTCCTAGAAGGGCAGTTTCAGTGCACGTGTTTGGCTCTGAACCCCGACTGG
and file 2 (a text file) with just the simplified names:
>000087d4-94a9-4af9-a82b-db9caeebb418#LINE/L2
>000087d4-94a9-4af9-a82b-db9caeebb418#Unknown
>000087d4-94a9-4af9-a82b-db9caeebb418#LINE/L1
>000087d4-94a9-4af9-a82b-db9caeebb418#LINE/L2
My actual sequence lines are all 1 line, so I want to replace the line of the fast file that contains the matching contents, and retain the line of sequence that follows. Note that some of the sequence names are copies due to the TE being from a different part of the initial read, so I'm also unsure how to ensure that all copies get included in the output. Hopefully that makes sense? I'm worried that this isn't possible due to the odd format of the fasta file headers.
Thank you in advance!!!
Please do not delete threads once they have received a comment/answer.
You have duplicate lines in file 2. Please clean up the example files and post what you expect. There are tools to change the headers of a fasta file from a different file (seqtk, seqkit etc). Unless you post workable input(s) and expected output, it would be difficult to address the issue. However, If you are looking for patten in input header try this: