How can I add an incremented value just after Accession no. ?
3
0
Entering edit mode
6.4 years ago

I have a large csv file (1.7GB) containing sequences, and i have to provide a header to each sequence, so i did some thing like this with bash to do same:

*#/bin/bash

cat con_test.csv > out.out

for file in out.out; do

sed -e 's/^/>NZ_CP00000.1 volvox complete genome\n/' -i "$file" done*

my input files:

AAAAAAAATGTGCTCCGGCCTCCGCGAAATTCGCGACGCCGGCCGCGTGGGCATGCACGTC

GGCCGTTACCTGGAGCCAGCGGGACTCGAAGGATGCCCCACGATGAGTTCAGCAGCAATGA

CCAAGCCTGCGCGTGCCCTGCGTGGTTCTTCCCCACAGCAGCACACCGTGAGGGCAAACTG

TCGCCGCACGTTCGGGCAAAAAAACCTGACGTGCGCGGTCTTGTAAAGCGGTTAGTCACCGA

AGGGCACGCGGGGCCGATTCGCACCGGCCGAGGTCTGCCCAAGGCAACCCCTAGAGTCTAG

my output file after running this script. (NZ_CP00000.1)

NZ_CP00000.1 volvox complete genome AAAAAAAATGTGCTCCGGCCTCCGCGAAATTCGCGACGCCGGCCGCGTGGGCATGCACGTC

NZ_CP00000.1 volvox complete genome GGCCGTTACCTGGAGCCAGCGGGACTCGAAGGATGCCCCACGATGAGTTCAGCAGCAATGA

NZ_CP00000.1 volvox complete genome CCAAGCCTGCGCGTGCCCTGCGTGGTTCTTCCCCACAGCAGCACACCGTGAGGGCAAACTG

NZ_CP00000.1 volvox complete genome TCGCCGCACGTTCGGGCAAAAAAACCTGACGTGCGCGGTCTTGTAAAGCGGTTAGTCACCGA

NZ_CP00000.1 volvox complete genome AGGGCACGCGGGGCCGATTCGCACCGGCCGAGGTCTGCCCAAGGCAACCCCTAGAGTCTAG

Now i want to assign a different or unique value with the accession no. to all my sequences, so that the description line looks something like this: ( NZ_CP00000.1_000000001) and the unique value incremented for every time

>NZ_CP00000.1_000000001 volvox complete genome AAAAAAAATGTGCTCCGGCCTCCGCGAAATTCGCGACGCCGGCCGCGTGGGCATGCACGTC

>NZ_CP00000.1_000000002 volvox complete genome GGCCGTTACCTGGAGCCAGCGGGACTCGAAGGATGCCCCACGATGAGTTCAGCAGCAATGA

>NZ_CP00000.1_000000003 volvox complete genome CCAAGCCTGCGCGTGCCCTGCGTGGTTCTTCCCCACAGCAGCACACCGTGAGGGCAAACTG

>NZ_CP00000.1_000000004 volvox complete genome TCGCCGCACGTTCGGGCAAAAAAACCTGACGTGCGCGGTCTTGTAAAGCGGTTAGTCACCGA

>NZ_CP00000.1_000000005 volvox complete genome AGGGCACGCGGGGCCGATTCGCACCGGCCGAGGTCTGCCCAAGGCAACCCCTAGAGTCTAG

how can i achieve this?

alignment Assembly genome • 1.1k views
ADD COMMENT
2
Entering edit mode
6.4 years ago
awk '/^>/ {$1=sprintf("%s_%010d",$1,++N);} {print;}' input.fa
ADD COMMENT
0
Entering edit mode

thank you so much it works..

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY
1
Entering edit mode
6.4 years ago
$ seqkit replace -p '^(.+?) (.+)' -r '${1}_{nr} $2' --nr-width 9 -w 0 seq.fa
>NZ_CP00000.1_000000001 volvox complete genome
AAAAAAAATGTGCTCCGGCCTCCGCGAAATTCGCGACGCCGGCCGCGTGGGCATGCACGTC
>NZ_CP00000.1_000000002 volvox complete genome
GGCCGTTACCTGGAGCCAGCGGGACTCGAAGGATGCCCCACGATGAGTTCAGCAGCAATGA
>NZ_CP00000.1_000000003 volvox complete genome
CCAAGCCTGCGCGTGCCCTGCGTGGTTCTTCCCCACAGCAGCACACCGTGAGGGCAAACTG
>NZ_CP00000.1_000000004 volvox complete genome
TCGCCGCACGTTCGGGCAAAAAAACCTGACGTGCGCGGTCTTGTAAAGCGGTTAGTCACCGA
>NZ_CP00000.1_000000005 volvox complete genome
AGGGCACGCGGGGCCGATTCGCACCGGCCGAGGTCTGCCCAAGGCAACCCCTAGAGTCTA
ADD COMMENT
0
Entering edit mode

i think seqkit is not installed thats why generating error. command not found.

ADD REPLY
0
Entering edit mode

oh, you can google it

ADD REPLY
0
Entering edit mode
6.4 years ago
nl -nrz   -bp">"   test.fa | sed "/>/ s/^\([0-9]\+\).*\(>\w\+\.[0-9]\)\(.*\)/\2_\1\3/g;s/^\s\+//g" 

>NZ_CP00000.1_000001 volvox complete genome
AAAAAAAATGTGCTCCGGCCTCCGCGAAATTCGCGACGCCGGCCGCGTGGGCATGCACGTC
>NZ_CP00000.1_000002 volvox complete genome
GGCCGTTACCTGGAGCCAGCGGGACTCGAAGGATGCCCCACGATGAGTTCAGCAGCAATGA
>NZ_CP00000.1_000003 volvox complete genome
CCAAGCCTGCGCGTGCCCTGCGTGGTTCTTCCCCACAGCAGCACACCGTGAGGGCAAACTG
>NZ_CP00000.1_000004 volvox complete genome
TCGCCGCACGTTCGGGCAAAAAAACCTGACGTGCGCGGTCTTGTAAAGCGGTTAGTCACCGA
>NZ_CP00000.1_000005 volvox complete genome
AGGGCACGCGGGGCCGATTCGCACCGGCCGAGGTCTGCCCAAGGCAACCCCTAGAGTCTAG
ADD COMMENT

Login before adding your answer.

Traffic: 987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6