Hi all,
I'm trying to create a pair of bash commands (or single command) to:
(1) Extract $ACCESSION from a FASTA header from the format
>$ACCESSION Genus species strain
It is always followed by space and contains decimal and number at end. EX: NC123456.7
(2) Add $GI to the same FASTA header in the format
>gi|$GI|ACCESSION Genus species strain
...essentially adding the GI, prefix, and pipes to the header from (1).
In between these two commands I have already figured out how to query the GI from ACCESSION using:
GI=$(curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$ACCESSION&rettype=gi)
Can you please give me an example of the most efficient way to complete this task? Much appreciated in advance!
EDIT: I should mention that I need to keep this task within the confines of a single shell script. I am also downloading genome assemblies as a multi-seq FASTA, splitting them (already done), but need to add GIs to the headers for taxon mapping. There are hundreds of assemblies with many contigs each.
It would be much easier to use biopython or bioperl.