Question

edit headers of fasta files

0

Entering edit mode

6.2 years ago

erick_rc93 ▴ 30

I have a directory with fasta files and these files have headers like this

First I wanted only the ID (WP_07039397531 for example) for each file, and then I did it with the next code line

for file in *.fna; do cut -d '|' -f1 $file  | grep ">" | sed 's/ID/ /g' | sed 's/[:>]//g' > "${file/.fna/_ids.txt}"; done

and I get the a list like the following, I would like to replace the number before ".1 " by "[0-9]"

WP_012167065.1 
 WP_015214247.1 
 WP_015083735.1 
 WP_035159822.1 
 WP_096595623.1 
 WP_096613742.1 
 WP_096613838.1 
 WP_096694933.1 
 WP_015201116.1 
 WP_015173923.1 
 ADB95635.1

The output will be the next list_ids.txt

 WP_01216706[0-9].1 
 WP_01521424[0-9].1 
 WP_01508373[0-9].1 
 WP_03515982[0-9].1 
 WP_09659562[0-9].1 
 WP_09661374[0-9].1

and then I want to do a grep with the next code line

for file in *.gbk; do  cat list_ids.txt | while read line; do grep -B 2  "$line" "$file"; done ; done

I hope you can help me.

sequence • 2.1k views

ADD COMMENT • link updated 6.2 years ago by sacha ★ 2.4k • written 6.2 years ago by erick_rc93 ▴ 30

0

Entering edit mode

Just add another sed command to your first long pipe to do something like s/./[0-9]./g?

You may need to backslash escape the square brackets because they have a special meaning to sed.

ADD REPLY • link 6.2 years ago by Joe 21k

0

Entering edit mode

output:

$ sed '/>/ s/\..\s|\s.*//1' test.fa
> ID:WP_070393975
atgc
> ID:WP_070393975
tagc

input:

$ cat test.fa
> ID:WP_070393975.1 | [Moorea producens PAL-8-15-08-1] | PAL-8-15-08-1 | hypothetical protein | 351 | NZ_CP017599(9673108):5662931-5663281:-1 ^^ Moorea producens PAL-8-15-08-1 chromosome, complete genome.
atgc
> ID:WP_070393975.1 | [Moorea producens PAL-8-15-08-1] | PAL-8-15-08-1 | hypothetical protein | 351 | NZ_CP017599(9673108):5662931-5663281:-1 ^^ Moorea producens PAL-8-15-08-1 chromosome, complete genome.
tagc

ADD REPLY • link 6.2 years ago by cpad0112 21k

0

Entering edit mode

That output is not what the OP is looking for cpad. It needs to have the string '[0-9]' prepended before the period is all.

ADD REPLY • link 6.2 years ago by Joe 21k

0

Entering edit mode

jrj.healey You are right. Amended code below:

$ sed -n '/>/ s/>\s//g;s/.\(.\{2\}\)\s| .*/[0-9]\1/1p' test.fa 

ID:WP_07039397[0-9].1
ID:WP_07039397[0-9].1

input remains the same as OP above.

ADD REPLY • link 6.2 years ago by cpad0112 21k

score 0 · Answer 1 · 2018-09-05

0

Entering edit mode

6.2 years ago

sacha ★ 2.4k

I use seqkit for fasta manipulation

Try to select and replace fasta header with seqkit. Use grep and replace command using regular expression and capture. Something like this :

 seqkit grep -nr -p  "WP_\d+\.\d" test.fa|seqkit replace -p ".+(WP_\d+)\.(\d).+" -r '$1[0-9].$2'

output :

 >WP_070393975[0-9].1
 ACGTAA

seqkit grep -nr -p "WP_\d+.\d" test.fa => filter fasta by WP_xxxxx.x
seqkit replace -p ".+(WP_\d+).(\d).+" -r '$1[0-9].$2' => capture (WP_xxxx).(x) and replace by $1[0-9]$2

ADD COMMENT • link 6.2 years ago by sacha ★ 2.4k

0

Entering edit mode

sacha : Is this post incomplete?

ADD REPLY • link 6.2 years ago by GenoMax 147k

0

Entering edit mode

It is a mistake. I fixed it. Sorry

ADD REPLY • link 6.2 years ago by sacha ★ 2.4k