edit headers of fasta files
1
0
Entering edit mode
6.2 years ago
erick_rc93 ▴ 30

I have a directory with fasta files and these files have headers like this

> ID:WP_070393975.1 | [Moorea producens PAL-8-15-08-1] | PAL-8-15-08-1 | hypothetical protein | 351 | NZ_CP017599(9673108):5662931-5663281:-1 ^^ Moorea producens PAL-8-15-08-1 chromosome, complete genome.

First I wanted only the ID (WP_07039397531 for example) for each file, and then I did it with the next code line

for file in *.fna; do cut -d '|' -f1 $file  | grep ">" | sed 's/ID/ /g' | sed 's/[:>]//g' > "${file/.fna/_ids.txt}"; done

and I get the a list like the following, I would like to replace the number before ".1 " by "[0-9]"

WP_012167065.1 
 WP_015214247.1 
 WP_015083735.1 
 WP_035159822.1 
 WP_096595623.1 
 WP_096613742.1 
 WP_096613838.1 
 WP_096694933.1 
 WP_015201116.1 
 WP_015173923.1 
 ADB95635.1

The output will be the next list_ids.txt

 WP_01216706[0-9].1 
 WP_01521424[0-9].1 
 WP_01508373[0-9].1 
 WP_03515982[0-9].1 
 WP_09659562[0-9].1 
 WP_09661374[0-9].1

and then I want to do a grep with the next code line

for file in *.gbk; do  cat list_ids.txt | while read line; do grep -B 2  "$line" "$file"; done ; done

I hope you can help me.

sequence • 2.1k views
ADD COMMENT
0
Entering edit mode

Just add another sed command to your first long pipe to do something like s/./[0-9]./g?

You may need to backslash escape the square brackets because they have a special meaning to sed.

ADD REPLY
0
Entering edit mode

output:

$ sed '/>/ s/\..\s|\s.*//1' test.fa
> ID:WP_070393975
atgc
> ID:WP_070393975
tagc

input:

$ cat test.fa
> ID:WP_070393975.1 | [Moorea producens PAL-8-15-08-1] | PAL-8-15-08-1 | hypothetical protein | 351 | NZ_CP017599(9673108):5662931-5663281:-1 ^^ Moorea producens PAL-8-15-08-1 chromosome, complete genome.
atgc
> ID:WP_070393975.1 | [Moorea producens PAL-8-15-08-1] | PAL-8-15-08-1 | hypothetical protein | 351 | NZ_CP017599(9673108):5662931-5663281:-1 ^^ Moorea producens PAL-8-15-08-1 chromosome, complete genome.
tagc
ADD REPLY
0
Entering edit mode

That output is not what the OP is looking for cpad. It needs to have the string '[0-9]' prepended before the period is all.

ADD REPLY
0
Entering edit mode

jrj.healey You are right. Amended code below:

$ sed -n '/>/ s/>\s//g;s/.\(.\{2\}\)\s| .*/[0-9]\1/1p' test.fa 

ID:WP_07039397[0-9].1
ID:WP_07039397[0-9].1

input remains the same as OP above.

ADD REPLY
0
Entering edit mode
6.2 years ago
sacha ★ 2.4k

I use seqkit for fasta manipulation

Try to select and replace fasta header with seqkit. Use grep and replace command using regular expression and capture. Something like this :

 seqkit grep -nr -p  "WP_\d+\.\d" test.fa|seqkit replace -p ".+(WP_\d+)\.(\d).+" -r '$1[0-9].$2'

output :

 >WP_070393975[0-9].1
 ACGTAA
  • seqkit grep -nr -p "WP_\d+.\d" test.fa => filter fasta by WP_xxxxx.x
  • seqkit replace -p ".+(WP_\d+).(\d).+" -r '$1[0-9].$2' => capture (WP_xxxx).(x) and replace by $1[0-9]$2
ADD COMMENT
0
Entering edit mode

sacha : Is this post incomplete?

ADD REPLY
0
Entering edit mode

It is a mistake. I fixed it. Sorry

ADD REPLY

Login before adding your answer.

Traffic: 2640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6