Question

Rename the fasta entries in Unix or R

0

Entering edit mode

7.2 years ago

horsedog ▴ 60

I'd like to change the entries of each fasta file

from:

gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome

to:

Escherichia_coli_str._K-12_substr._MG1655

which means i want to remove the accession number and just want to keep the species name, at the same time all the space is replaced by underscore. either R or unix is ok.

Thank you very much.

R genome • 3.6k views

ADD COMMENT • link updated 7.2 years ago by cpad0112 21k • written 7.2 years ago by horsedog ▴ 60

2

Entering edit mode

Always mention what you've tried. Your questions suggests that you just want an answer and are not interested in learning how to get there, which should not be how anyone approaches this.

ADD REPLY • link 7.2 years ago by Ram 44k

score 1 · Answer 1 · 2017-09-14

1

Entering edit mode

7.2 years ago

Matteo Schiavinato ★ 3.6k

I would strongly suggest you to use bioawk for these operations. It is really handy.

bioawk -c fastx '{split($name, a, "|"); print ">"a[5]"\n"$seq}' file.fa | tr " " "_"

This should do. Have a look at install bioawk in unix system

ADD COMMENT • link 7.2 years ago by Matteo Schiavinato ★ 3.6k

score 1 · Answer 2 · 2017-09-14

1

Entering edit mode

7.2 years ago

Sej Modha 5.3k

Simple bash solution:

cat file.fa |awk -F'[|,]' '{print $1$5}' | sed -e 's/ /_/g;s/gi//g'

ADD COMMENT • link 7.2 years ago by Sej Modha 5.3k

0

Entering edit mode

awk  -F '[/^>|,]' 'NF>1{gsub(" ","_",$6);print ">"$6} {print $1}'  test1.fa | awk NF

input:

$ cat test1.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT

ADD REPLY • link 7.2 years ago by cpad0112 21k

score 1 · Answer 3 · 2017-09-14

awk -F '|' '/^>/ {s=$5; gsub(/,.*/,"",s);gsub(/ /,"_",s); printf(">%s\n",s);next;} {print;}' input.fa

ex:

~$ echo -e '>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome\nATGC' | awk -F '|' '/^>/ {s=$5; gsub(/,.*/,"",s);gsub(/ /,"_",s); printf(">%s\n",s);next;} {print;}'
>Escherichia_coli_str._K-12_substr._MG1655
ATGC

score 1 · Answer 4 · 2017-09-14

1

Entering edit mode

7.2 years ago

Jake Warner ▴ 840

Adding an R solution for people who hate the speed of awk!

library(Biostrings)
library(dplyr)

fasta <- readDNAStringSet(filepath = 'test.fa', format="fasta")
names(fasta)
##[1] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah FIRST SEQ" 
##[2] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah SECOND SEQ"
##[3] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah THIRD SEQ"

names(fasta) <- 
  names(fasta) %>%
  strsplit(., split="|",fixed=TRUE) %>%
  sapply(., '[', 5) %>%
  gsub(" ", "_",.)

names(fasta)
##[1] "Escherichia_coli_str._blah_blah_FIRST_SEQ" 
##[2] "Escherichia_coli_str._blah_blah_SECOND_SEQ"
##[3] "Escherichia_coli_str._blah_blah_THIRD_SEQ" 

writeXStringSet(fasta, filepath = 'test_EDITED.fa',format="fasta")

ADD COMMENT • link 7.2 years ago by Jake Warner ▴ 840

2

Entering edit mode

for people who hate the speed of awk

dat sarcasm tho :D

ADD REPLY • link 7.2 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Another R solution for test.fa:

test.fa: sequence is copied twice to show that script is general and works with fasta with multiple sequences:

$ cat test.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
GTCTGG

R code:

library(Biostrings)
library(stringr)
fasta <- readDNAStringSet(filepath = 'test.fa', format="fasta")
names(fasta)=gsub(" ","_",str_split_fixed(str_split_fixed(names(fasta),"\\|",5)[,5],",",2)[,1])
writeXStringSet(fasta, filepath = 'test_edited.fa',format="fasta")

ADD REPLY • link 7.2 years ago by cpad0112 21k

score 0 · Answer 5 · 2017-09-14

Brain isn't functioning well enough to make one regex out of this, but it's basically just 2 string removals, and a transliteration (whitespace to underscore

$ echo "gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome" | sed -e 's/.*|//' -e 's/,.*//' | tr ' ' '_'

Yeilds

Escherichia_coli_str._K-12_substr._MG1655

Obviously just change echo to cat if you're dealing with a file.

score 0 · Answer 6 · 2017-09-14

0

Entering edit mode

7.2 years ago

cpad0112 21k

$ cat test.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT

code and output:

$ sed -re '/>/ s/.*\|(.*),.*/>\1/' -e 's/ /_/g' test1.fa 
>Escherichia_coli_str._K-12_substr._MG1655
ATCGT
>Escherichia_coli_str._K-12_substr._MG1655
ATCGT

To make a general script that works with fasta with one or more than one sequences, i copy/pasted the same sequence twice.

ADD COMMENT • link 7.2 years ago by cpad0112 21k

0

Entering edit mode

Close, but you're missing the transliteration from space to underscore the OP wants ;)