Question

Renaming fasta files with their headers

0

Entering edit mode

18 months ago

sebabiokr ▴ 10

Hi

I have around 85 gene sequences in individual fasta files. I'd like to rename each file with their header name containing the gene name in [gene=]. For each header, I only want what is in-between the brackets. I'm trying to do this through linux commands.

in fasta file input

>lcl|NC_018552.1_cds_YP_006666009.1_1 [gene=rps12] [locus_tag=C329_pgp044] [db_xref=GeneID:13540299] [protein=ribosomal protein S12] [exception=trans-splicing] [protein_id=YP_006666009.1] [location=complement(join(100912..100937,101474..101705,72928..73041))] [gbkey=CDS]
ATGCCAACTATTAAACAACTTATTAGAAATACAAGACAGCCAATCAGAAATGTCACGAAATCCCCCGCTC
>lcl|NC_018552.1_cds_YP_006666010.1_2 [gene=psbA] [locus_tag=C329_pgp089] [db_xref=GeneID:13540179] [protein=photosystem II protein D1] [protein_id=YP_006666010.1] [location=complement(565..1626)] [gbkey=CDS]
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGGATAACTAGCA

in need fasta file output

>rps12
ATGCCAACTATTAAACAACTTATTAGAAATACAAGACAGCCAATCAGAAATGTCACGAAATCCCCCGCTC
>psbA
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGGATAACTAGCA

Can anyone help with this?

TIA

Linux fasta • 1.8k views

ADD COMMENT • link updated 18 months ago by Ram 45k • written 18 months ago by sebabiokr ▴ 10

0

Entering edit mode

This type of question is among the most frequently asked on this forum. Searching through previous posts should give you several different options of doing this task.

ADD REPLY • link 18 months ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Yes there are many answer in this forum but they are specific to header line. it doesnot work with my header line and its difficult for me to change the command by myself. TIA

ADD REPLY • link 18 months ago by sebabiokr ▴ 10

0

Entering edit mode

i tried this command

awk -F 'gene=|]|[.]{1}' '/^>/ {print $2}' 7seqNC_018552.1cds.fasta > 7NC_018552.fasta

got only gene name alone without contig seq in out. Any help appreciated

ADD REPLY • link 18 months ago by sebabiokr ▴ 10

score 3 · Answer 1 · 2023-12-11

3

Entering edit mode

18 months ago

shenwei356 8.7k

I've got a general method, but you need to have some basic knowledge of regular expressions. If not, please learn them for 30 minutes:

After that, you can easily understand what gene=(.+?)\] means.

$ seqkit seq -i --id-regexp 'gene=(.+?)\]' seqs.fasta 
>rps12
ATGCCAACTATTAAACAACTTATTAGAAATACAAGACAGCCAATCAGAAATGTCACGAAA
TCCCCCGCTC
>psbA
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGG
ATAACTAGCA

ADD COMMENT • link 18 months ago by shenwei356 8.7k

0

Entering edit mode

Yes, Its perfectly working.. It will be useful if you share concat command too if there are same genes in the fasta file Thank you

ADD REPLY • link 18 months ago by sebabiokr ▴ 10

score 0 · Answer 2 · 2023-12-11

0

Entering edit mode

18 months ago

liorglic ★ 1.5k

Assuming that you have a directory containing multiple files with .fasta extension, each containing one record, you can do something like:
for f in $(ls -1 *.fasta); do name=$(head -1 $f | sed 's/.*\[gene=$[a-zA-Z0-9_]*$.*/\1/'); mv $f $name".fasta"; done
Also, this can be easily achieved using a short python script, so you might consider that as well.

ADD COMMENT • link 18 months ago by liorglic ★ 1.5k

0

Entering edit mode

Thank you for your code, but i have a fasta file which contains [gene=rps12] ids, so i want to rename the header with gene id alone like >rps12 atgcgtacg.

ADD REPLY • link 18 months ago by sebabiokr ▴ 10

1

Entering edit mode

It's hard to understand what your input and desired output look like. Please edit your original question so it includes a clear explanation. Give an example of a real header and what the output file should look like.

ADD REPLY • link 18 months ago by liorglic ★ 1.5k