>a
ACTCTAAAT
>b
AAAAACCCT
etc.
To
>a_1
ACTCTAAAT
>b_2
AAAAACCCT
awk '/^>/{$0=$0"_"(++i)}1' in > out
>a
ACTCTAAAT
>b
AAAAACCCT
etc.
To
>a_1
ACTCTAAAT
>b_2
AAAAACCCT
awk '/^>/{$0=$0"_"(++i)}1' in > out
Another way to do it, which works with single-line FASTA input:
$ awk 'BEGIN{RS=">"}{if(NR>1)print ">"$1"_"(NR-1)"\n"$2}' input.fa > output.fa
A second way, which allows multiline FASTA input:
$ awk 'BEGIN{RS=">";OFS="\n"}(NR>1){print ">"$1"_"(NR-1)"\n";$1="";print $0}' input.fa | awk '$0' > output.fa
Hi - I know this post is quite old now, but I tried the above code for the multi-line fasta and it produced a fasta file that had only the names and then the number, but none of the sequences. I would like to add the numbers to the headers, but keeping the sequences intact in the output file - Is there any way to do this? I have been trying to resolve a makeblastdb error saying I have duplicate seqids and I was hoping this approach might resolve the error.
Edit: I tried the line from Alex Reynolds: awk '/^>/{$0=$0"_"(++i)}1' in > out
And it successfully added a number at the end of the description line, but I am trying to add the number addition directly to the end of the sequence ID number since adding it to the description didn't seem to help my error on makeblastdb.
Eg: I want: " >CP064824.1_1 Klebsiella pneumoniae subsp. pneumoniae strain K219 plasmid pIncFIB, complete sequence TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATTATTTCTCTCTGACATCACGCCGTGCGTGTAA...." Instead of: ">CP064824.1 Klebsiella pneumoniae subsp. pneumoniae strain K219 plasmid pIncFIB, complete sequence_1 TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATTATTTCTCTCTGACATCACGCCGTGCGTGTAA...."
Thanks!
Given input.fa
:
>CP064824.1 Klebsiella pneumoniae
TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATT
>ABC12345.6 Mycobacterium tuberculosis
ATTTCTCTCTGACATCACGCCGTGCGTGTAA
>CP064824.1 Klebsiella pneumoniae
GCATTCTGAATGGCAGCATTTGAACGTCACTGCCATC
Note the two Klebsiella pneumoniae
entries.
The following will append an incremented counter to the sequence ID:
$ awk 'BEGIN{FS=" ";RS=">"}{if(NR>1){ a[$1]++; h=""; for(i=2; i<NF; i++) { h=h" "$i; } print ">"$1"_"a[$1]h"\n"$NF; }}' input.fa
>CP064824.1_1 Klebsiella pneumoniae
TGAACGTCACTGCCATCCTGCATTCTGAATGGCAGCATT
>ABC12345.6_1 Mycobacterium tuberculosis
ATTTCTCTCTGACATCACGCCGTGCGTGTAA
>CP064824.1_2 Klebsiella pneumoniae
GCATTCTGAATGGCAGCATTTGAACGTCACTGCCATC
For multiline FASTA, you'd need to make modifications, but hopefully this gives you some ideas.
You can do something like following
cat file.fa | paste - - | awk '{print $1"_"NR"\n"$2}' > new_file.fa
Oops! I think I was too hurry as I am busy with our Biostars Handbook.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Could you expand a bit on your post? What's the purpose of doing this?
Soory, I just want to record this. thanks for you concern, I will explain more for next time.
ZQ