Renaming fasta files with their headers
2
0
Entering edit mode
11 months ago
sebabiokr ▴ 10

Hi

I have around 85 gene sequences in individual fasta files. I'd like to rename each file with their header name containing the gene name in [gene=]. For each header, I only want what is in-between the brackets. I'm trying to do this through linux commands.

in fasta file input

>lcl|NC_018552.1_cds_YP_006666009.1_1 [gene=rps12] [locus_tag=C329_pgp044] [db_xref=GeneID:13540299] [protein=ribosomal protein S12] [exception=trans-splicing] [protein_id=YP_006666009.1] [location=complement(join(100912..100937,101474..101705,72928..73041))] [gbkey=CDS]
ATGCCAACTATTAAACAACTTATTAGAAATACAAGACAGCCAATCAGAAATGTCACGAAATCCCCCGCTC
>lcl|NC_018552.1_cds_YP_006666010.1_2 [gene=psbA] [locus_tag=C329_pgp089] [db_xref=GeneID:13540179] [protein=photosystem II protein D1] [protein_id=YP_006666010.1] [location=complement(565..1626)] [gbkey=CDS]
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGGATAACTAGCA

in need fasta file output

>rps12
ATGCCAACTATTAAACAACTTATTAGAAATACAAGACAGCCAATCAGAAATGTCACGAAATCCCCCGCTC
>psbA
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGGATAACTAGCA

Can anyone help with this?

TIA

Linux fasta • 1.2k views
ADD COMMENT
0
Entering edit mode

This type of question is among the most frequently asked on this forum. Searching through previous posts should give you several different options of doing this task.

ADD REPLY
0
Entering edit mode

Yes there are many answer in this forum but they are specific to header line. it doesnot work with my header line and its difficult for me to change the command by myself. TIA

ADD REPLY
0
Entering edit mode

i tried this command

awk -F 'gene=|]|[.]{1}' '/^>/ {print $2}' 7seqNC_018552.1cds.fasta > 7NC_018552.fasta

got only gene name alone without contig seq in out. Any help appreciated

ADD REPLY
3
Entering edit mode
11 months ago

I've got a general method, but you need to have some basic knowledge of regular expressions. If not, please learn them for 30 minutes:

After that, you can easily understand what gene=(.+?)\] means.

$ seqkit seq -i --id-regexp 'gene=(.+?)\]' seqs.fasta 
>rps12
ATGCCAACTATTAAACAACTTATTAGAAATACAAGACAGCCAATCAGAAATGTCACGAAA
TCCCCCGCTC
>psbA
ATGACTGCAATTTTAGAGAGACGCGAAAGCGAAAGCCTATGGGGTCGCTTCTGTAACTGG
ATAACTAGCA
ADD COMMENT
0
Entering edit mode

Yes, Its perfectly working.. It will be useful if you share concat command too if there are same genes in the fasta file Thank you

ADD REPLY
0
Entering edit mode
11 months ago
liorglic ★ 1.5k

Assuming that you have a directory containing multiple files with .fasta extension, each containing one record, you can do something like:
for f in $(ls -1 *.fasta); do name=$(head -1 $f | sed 's/.*\[gene=\([a-zA-Z0-9_]*\).*/\1/'); mv $f $name".fasta"; done
Also, this can be easily achieved using a short python script, so you might consider that as well.

ADD COMMENT
0
Entering edit mode

Thank you for your code, but i have a fasta file which contains [gene=rps12] ids, so i want to rename the header with gene id alone like >rps12 atgcgtacg.

ADD REPLY
1
Entering edit mode

It's hard to understand what your input and desired output look like. Please edit your original question so it includes a clear explanation. Give an example of a real header and what the output file should look like.

ADD REPLY

Login before adding your answer.

Traffic: 1514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6