Question

add nucleotide in the begining of fasta sequences

0

Entering edit mode

4.1 years ago

amitpande74 ▴ 20

HI, I want to add 2 nucleotides in the beginning of each line in a FASTA file.

> 
GCATAGGC

the desired output

>
TAGCATAGGC

can someone help.

fasta sequence add nucleotide • 2.3k views

ADD COMMENT • link updated 4.1 years ago by shenwei356 8.7k • written 4.1 years ago by amitpande74 ▴ 20

0

Entering edit mode

What have you tried? This can be done with a sed command that matched the first character and replaced the line-beginning anchor with TA.

ADD REPLY • link 4.1 years ago by Ram 44k

0

Entering edit mode

sed -i 's/^/TA/' file.fasta

ADD REPLY • link 4.1 years ago by amitpande74 ▴ 20

0

Entering edit mode

That does not match the first character in each line. You'll end up adding TA to the header lines too, and that too before the > lines, essentially corrupting the FASTA file.

Also, don't use -i until you're 100% sure the command is exactly what you want.

ADD REPLY • link 4.1 years ago by Ram 44k

0

Entering edit mode

yes, it does add a TA to the header. Then what exactly should be the command.

ADD REPLY • link 4.1 years ago by amitpande74 ▴ 20

0

Entering edit mode

amitpande74, please accept all answers that solve your question.

Upvote|Bookmark|Accept

ADD REPLY • link 4.1 years ago by Ram 44k

0

Entering edit mode

A: Fasta file edition

Replace "ACTG" with "TA".

ADD REPLY • link 4.1 years ago by GenoMax 147k

score 3 · Answer 1 · 2020-10-21

3

Entering edit mode

4.1 years ago

shenwei356 8.7k

seqkit mutate can edit FASTA sequence (point mutation, insertion, deletion) . Please use v0.14.0rc1 or later version which fix a bug for insersion

seqkit mutate -i supports inserting bases at any position. For example, for two (multi-line) sequences.

$ cat seqs.fa 
>seq1
GCATAGGC
>seq2
AAACCC
GGGTTT

1). At the beginning

$ cat seqs.fa | seqkit mutate -i 0:TA
>seq1
TAGCATAGGC
>seq2
TAAAACCCGGGTTT

2). At the end.

$ cat seqs.fa | seqkit mutate -i -1:TA
>seq1
GCATAGGCTA
>seq2
AAACCCGGGTTTTA

3). Behind the 5th base

$ cat seqs.fa | seqkit mutate -i 5:TA
>seq1
GCATATAGGC
>seq2
AAACCTACGGGTTT

ADD COMMENT • link 4.1 years ago by shenwei356 8.7k

0

Entering edit mode

nice solution, great to know, this most certainly simplifies the task

ADD REPLY • link 4.1 years ago by Istvan Albert 101k

score 2 · Answer 2 · 2020-10-21

If each sequence is one and only one line, and they Capital letters. (This works for both nucleotide and amino acid sequences; you can replace [A-Z] with [ATGC] if you want to be more specific.)

sed '/^[A-Z]/s/^/TA/'  fila.fasta > output.fasta

If you also have multi-line sequences, then you can first use this command to convert it to one-liner sequences:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'  input.fasta > file.fasta

score 2 · Answer 3 · 2020-10-21

2

Entering edit mode

4.1 years ago

cpad0112 21k

close. filter the headers ( assuming that sequences are in single line):

$ sed '/^>/! s/^/TA/' test.fa

or, you can also use:

$ sed  "0~2 s/^/TA&/" test.fa

with Awk:

$ awk -v OFS="\n" '/^>/ {getline seq; print $0,"TA"seq}' test.fa
$ awk '{print ((NR%2)? "":"TA") $0}' test.fa

ADD COMMENT • link 4.1 years ago by cpad0112 21k

score 2 · Answer 4 · 2020-10-21

When the FASTA file may span multiple lines and when the resulting FASTA should be well-formed (wrapped at the same length) one needs to chain up more commands.

My best bet makes use of both bioawk and seqkit (both a installable with bioconda):

cat foo.fa | bioawk -v prefix="TATA" -c fastx '{ printf(">%s\n%s%s",$name, prefix, $seq) }' | seqkit seq

prints

>foo
TATAATGGACTCTCGTCCTCAGAAAGTCTGGATGACGCCGAGTCTCACTGAATCTGACAT
GGATTACCACAAGATCTTGACAGCAGGTCTGTCCGTTCAACAGGGGGTTGTTCGGCAAAG
AGTCATCCCAGTGTATCAAGTAAACAATCTTGAGATCCCAGTGTATCAAGTAAACAATCT
TGAGATCCCAGTGTATCAAGTAAACAATCTTGAGATCCCAGTGTATCAAGTAAACAATCT
TGAGATCCCAGTGTATCAAGTAAACAATCTTGAG

Uses the trick shown in A: Fasta file edition