Hello! I'm building a database of a certain gene family. I downloaded the fastas from uniprot , concatenated the resulting fastas using cat
and the fasta headers of each sequence have the following format:
> tr|D7RED9|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 GN=nidA3 PE=3 SV=1
I'm performing an alignment with mmseqs2 and I need that the gene information (the GN=
part) is the first string after the first pipe sign (|) on each fasta header. is there a way to do that using awk or R string manipulation?
I want that all my fasta headers have as first string just after the first pipe sign, the GN='gene name' part.
the expected result of each fasta header is the following:
> tr|GN=nidA3|D7RED9_9MYCO NidA3 (Fragment) OS=Mycobacterium sp. py145 OX=767442 PE=3 SV=1
Thanks for your time
check if this works:
Replace sed with gsed on MacOS.