Remove part of the header from multi-fasta file (another one)
1
0
Entering edit mode
5.0 years ago

Hi!!!

I have a multifasta file wih headers like:

>trnN-GUU_INIA601-ARAGORN_v1.2.38 ccsA_INIA601-blatX
>rpl16_INIA601-blatX ndhF_INIA601-blatX psbJ_INIA601-blatX
>trnW-CCA-I_INIA601-ARAGORN_v1.2.38 trnL-UAG_INIA601-ARAGORN_v1.2.38
>psaC_INIA601-blatX trnR-UCU_INIA601-ARAGORN_v1.2.38 ndhA_INIA601-blatX
>trnC-ACA_INIA601-ARAGORN_v1.2.38 trnW-CCA-II_INIA601-ARAGORN_v1.2.38

I would like some way to only leave the name of the gene, like:

>rpl16 
>trnW 
>psaC 
>trnC

Thank you so much for your kind help :)

gene sequence fasta • 888 views
ADD COMMENT
0
Entering edit mode

with seqkit:

$ seqkit replace -p "[-_].*" -r "" input.fa

check if it makes sense to remove "_INIA601" and every thing after "_INIA601" from fasta headers.

ADD REPLY
0
Entering edit mode
5.0 years ago
zubenel ▴ 120

By looking at the file I have assumed that tRNA gene names include codon sequence and are: "trnN-GUU", "trnW-CCA-I", trnC-ACA". By omitting codon sequence you would lose information and would not distinguish some cases as "trnR-UCG" or "trnR-CCG". So if you need to extract full gene names you can use:

perl -pe 's/_.*//g' multifasta_file

This regular expression finds everything starting with _ and changes it to nothing.

Otherwise, if you want to get result as you wrote, you can use:

perl -pe 's/[_-].*//g' multifasta_file

This expression removes everything that starts with _ or -. Perl regular expressions are greedy so the longest sequence found is changed to nothing.

ADD COMMENT

Login before adding your answer.

Traffic: 1807 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6