Rename the fasta entries in Unix or R
6
0
Entering edit mode
7.2 years ago
horsedog ▴ 60

I'd like to change the entries of each fasta file

from:

gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome

to:

Escherichia_coli_str._K-12_substr._MG1655

which means i want to remove the accession number and just want to keep the species name, at the same time all the space is replaced by underscore. either R or unix is ok.

Thank you very much.

R genome • 3.6k views
ADD COMMENT
2
Entering edit mode

Always mention what you've tried. Your questions suggests that you just want an answer and are not interested in learning how to get there, which should not be how anyone approaches this.

ADD REPLY
1
Entering edit mode
7.2 years ago

I would strongly suggest you to use bioawk for these operations. It is really handy.

bioawk -c fastx '{split($name, a, "|"); print ">"a[5]"\n"$seq}' file.fa | tr " " "_"

This should do. Have a look at install bioawk in unix system

ADD COMMENT
1
Entering edit mode
7.2 years ago
Sej Modha 5.3k

Simple bash solution:

cat file.fa |awk -F'[|,]' '{print $1$5}' | sed -e 's/ /_/g;s/gi//g'
ADD COMMENT
0
Entering edit mode
awk  -F '[/^>|,]' 'NF>1{gsub(" ","_",$6);print ">"$6} {print $1}'  test1.fa | awk NF

input:

$ cat test1.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
ADD REPLY
1
Entering edit mode
7.2 years ago
awk -F '|' '/^>/ {s=$5; gsub(/,.*/,"",s);gsub(/ /,"_",s); printf(">%s\n",s);next;} {print;}' input.fa

ex:

~$ echo -e '>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome\nATGC' | awk -F '|' '/^>/ {s=$5; gsub(/,.*/,"",s);gsub(/ /,"_",s); printf(">%s\n",s);next;} {print;}'
>Escherichia_coli_str._K-12_substr._MG1655
ATGC
ADD COMMENT
1
Entering edit mode
7.2 years ago
Jake Warner ▴ 840

Adding an R solution for people who hate the speed of awk!

library(Biostrings)
library(dplyr)

fasta <- readDNAStringSet(filepath = 'test.fa', format="fasta")
names(fasta)
##[1] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah FIRST SEQ" 
##[2] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah SECOND SEQ"
##[3] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah THIRD SEQ"

names(fasta) <- 
  names(fasta) %>%
  strsplit(., split="|",fixed=TRUE) %>%
  sapply(., '[', 5) %>%
  gsub(" ", "_",.)

names(fasta)
##[1] "Escherichia_coli_str._blah_blah_FIRST_SEQ" 
##[2] "Escherichia_coli_str._blah_blah_SECOND_SEQ"
##[3] "Escherichia_coli_str._blah_blah_THIRD_SEQ" 

writeXStringSet(fasta, filepath = 'test_EDITED.fa',format="fasta")
ADD COMMENT
2
Entering edit mode

for people who hate the speed of awk

dat sarcasm tho :D

ADD REPLY
0
Entering edit mode

Another R solution for test.fa:

test.fa: sequence is copied twice to show that script is general and works with fasta with multiple sequences:

$ cat test.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
GTCTGG

R code:

library(Biostrings)
library(stringr)
fasta <- readDNAStringSet(filepath = 'test.fa', format="fasta")
names(fasta)=gsub(" ","_",str_split_fixed(str_split_fixed(names(fasta),"\\|",5)[,5],",",2)[,1])
writeXStringSet(fasta, filepath = 'test_edited.fa',format="fasta")
ADD REPLY
0
Entering edit mode
7.2 years ago
Joe 21k

Brain isn't functioning well enough to make one regex out of this, but it's basically just 2 string removals, and a transliteration (whitespace to underscore

$ echo "gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome" | sed -e 's/.*|//' -e 's/,.*//' | tr ' ' '_'

Yeilds

Escherichia_coli_str._K-12_substr._MG1655

Obviously just change echo to cat if you're dealing with a file.

ADD COMMENT
0
Entering edit mode
7.2 years ago
$ cat test.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT

code and output:

$ sed -re '/>/ s/.*\|(.*),.*/>\1/' -e 's/ /_/g' test1.fa 
>Escherichia_coli_str._K-12_substr._MG1655
ATCGT
>Escherichia_coli_str._K-12_substr._MG1655
ATCGT

To make a general script that works with fasta with one or more than one sequences, i copy/pasted the same sequence twice.

ADD COMMENT
0
Entering edit mode

Close, but you're missing the transliteration from space to underscore the OP wants ;)

ADD REPLY
0
Entering edit mode

Thanks and updated the code.

ADD REPLY

Login before adding your answer.

Traffic: 2133 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6