Question

Replace headers of fasta while with simplified headers from another file

0

Entering edit mode

4.1 years ago

Ava • 0

Hello,

I am creating a custom TE library and need fasta file headers to be in a specific format.

If I have file 1 with headers like such:

 >L2-10_EL__1_000087d4-94a9-4af9-a82b-db9caeebb418--3803-3889   LINE/L2__frg=1__len=87_st=C_div=21.6_sp=idaho.fa
AAGTGACGTTCTCAGCAATCTTGGAGATGTTGTAAGGTCCTAGAAGGGCAGTTTCAGTGCACGTGTTTGGCTCTGAACCCCGACTGG

and file 2 (a text file) with just the simplified names:

>000087d4-94a9-4af9-a82b-db9caeebb418#LINE/L2
>000087d4-94a9-4af9-a82b-db9caeebb418#Unknown
>000087d4-94a9-4af9-a82b-db9caeebb418#LINE/L1
>000087d4-94a9-4af9-a82b-db9caeebb418#LINE/L2

My actual sequence lines are all 1 line, so I want to replace the line of the fast file that contains the matching contents, and retain the line of sequence that follows. Note that some of the sequence names are copies due to the TE being from a different part of the initial read, so I'm also unsure how to ensure that all copies get included in the output. Hopefully that makes sense? I'm worried that this isn't possible due to the odd format of the fasta file headers.

Thank you in advance!!!

python fasta bash • 1.9k views

ADD COMMENT • link updated 4.1 years ago by cpad0112 21k • written 4.1 years ago by Ava • 0

0

Entering edit mode

Please do not delete threads once they have received a comment/answer.

ADD REPLY • link 4.1 years ago by GenoMax 151k

0

Entering edit mode

You have duplicate lines in file 2. Please clean up the example files and post what you expect. There are tools to change the headers of a fasta file from a different file (seqtk, seqkit etc). Unless you post workable input(s) and expected output, it would be difficult to address the issue. However, If you are looking for patten in input header try this:

$ cat test.fa | sed -r '/^>/ s/.*__[0-9]+_(.*)--.*/>\1/g'                                                                                                                            

>000087d4-94a9-4af9-a82b-db9caeebb418
AAGTGACGTTCTCAGCAATCTTGGAGATGTTGTAAGGTCCTAGAAGGGCAGTTTCAGTGCACGTGTTTGGCTCTGAACCCCGACTGG

ADD REPLY • link 4.1 years ago by cpad0112 21k

score 0 · Answer 1 · 2021-04-02

You might do better just modifying the headers with a regex command. You can also create fields based on the current header conventions and remove the ones that are superfluous.

As a simple example, Trinity assembles contigs which it names hideously.

>TRINITY_DN100_c0_g1_i1 len=242 path=[0:0-241]

If your DNA sequence is on the next line, you can remove the length and path information from the header with:

awk '{FS="len="} {print $1}'

Since the field separator doesn't occur in the DNA sequence, those won't be changed. You can do something like this a step at a time and save intermediate files as you reach your solution.

You can also check out Biopython's .id() method on fasta files. Here is an example of a similar question..

score 0 · Answer 2 · 2021-04-03

From your single example it looks like you want to keep a fixed pattern from the existent fasta headers, consisting in fixed combinations of numbers and letters and a tag right after a blank space. This single example may not represent all the possibilities that the fasta headers may describe, but if it does, maybe this simple regex could do:

perl -pe 's/^>.+([a-z0-9]{8}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{12})\S+\s+([^_]+).+/>$1#$2/' file.fasta