How can I add the information at the end of the line to the beginning of the line in R
2
2
Entering edit mode
2.9 years ago
logbio ▴ 30

I have fasta file. I want to add the information in parentheses at the end of each line to the beginning of the line without the brackets.

From:

gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]

To:

Homo_sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]
String Programming Regex R gsub • 1.4k views
ADD COMMENT
0
Entering edit mode

if sequences are in single line and headers are in exactly in same format:

$ awk -F "[][>]" '/^>/{getline seq}{print ">"$3"_"$2,"["$3"]""\n"seq}' test.fa

>Homo sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2  [Homo sapiens]

with sed:

$ sed -r '/^>/ s/^>(.*)\s\[(.*)\]$/>\2_\1 \[\2\]/g' test.fa

>Homo sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]
ADD REPLY
2
Entering edit mode
2.9 years ago
fracarb8 ★ 1.7k

I am sure there is a better way

string <- "gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]" 
match <- sub("^.*\\[(.*)\\]$","\\1",string)
string <- sub("^",paste0(match,"_"), string)

# Update: Dunois Regex also works without needing stringr
string <- "gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]" 
string <- sub("(^.*)\\[([A-Z]{1}[a-z]+\\s[a-z]+)\\]","\\2_\\1\\[\\2\\]",string)

> string
[1] "Homo sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]"
ADD COMMENT
1
Entering edit mode

stringr's just for syntactic convenience.

And also, I forgot R can handle nested capture groups, so you can actually replace the regex from my solution with the much shorter sub("(.*([A-Z]+[a-z]+) ([a-z]+))", "\\2_\\3_\\1", string). Note I've also fixed the regex to account for the fact that OP wants an underscore within the species name.

There's probably an even more concise solution with a single capture group, but I can't really think of it now. (Not that this matters for the OP probably.)

ADD REPLY
1
Entering edit mode

thanks for the clarification. I personally prefer to avoid being too concise when regular expression are involved.

ADD REPLY
2
Entering edit mode
2.9 years ago
Dunois ★ 2.8k

Here you go:

library(stringr)

#Toy case.
df <- data.frame(x = "gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]", stringsAsFactors = FALSE)

#Un-edited.
df$x

#[1] "gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]"

#Using str_replace with nested capture groups to rearrange the text.
df$x <- str_replace(df$x, "(.*([A-Z]+[a-z]+) ([a-z]+))", "\\2_\\3_\\1")

#Result.
df$x

# [1] "Homo_sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]"
ADD COMMENT
1
Entering edit mode

Thank you for answer. how can i apply this method to whole file?

ADD REPLY
1
Entering edit mode

I assumed you had your file imported into R already, as a data.frame or something to that effect.

So if it's just a FASTA file you need to manipulate in general, and you're not bound to R, here's a solution assuming you're working in a Unix-like environment (e.g., Ubuntu, off of which I am basing the rest of the explanation here).

I'm assuming you have all your sequences in a file named input.fasta, which looks something like this:

>Homo_sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]
ATGCATGCGTGTGTGTGG
>Escherichia_coli_gi|122937398|ref|NP_001078989.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Escherichia coli]
ATGCATGCAGAGAGAGAG

To update your FASTA headers the way you've indicated in the OP, go to the command line, and execute this:

sed -r 's/^>(.*([A-Z]+[a-z]+) ([a-z]+))/>\2_\3_\1/g' input.fasta > output.fasta

input.fasta is the input to the command line utility sed, and your output will be stored in a file called output.fasta, which will look like this:

>Homo_sapiens_gi|122937398|ref|NP_001073932.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Homo sapiens]
ATGCATGCGTGTGTGTGG
>Escherichia_coli_gi|122937398|ref|NP_001078989.1| cytoplasmic dynein 2 heavy chain 1 isoform 2 [Escherichia coli]
ATGCATGCAGAGAGAGAG

I assume this is what you want?

Note: output.fasta will be stored wherever you're running sed from within the file system tree . To check where you are (from the command line) type in pwd, and it should indicate your current location as a path. Ideally what you want to do -- if you're inexperienced with this -- is to use your GUI file browser to navigate to the directory/folder where input.fasta is located, and launch the command line terminal from there (right click -> "Open in Terminal" in Ubuntu, for example). This way, output.fasta will be located exactly where input.fasta is.

ADD REPLY

Login before adding your answer.

Traffic: 2823 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6