Question

Batch rename RefSeq assembly for the corresponding organism name

0

Entering edit mode

6.1 years ago

genomes_and_MGEs ▴ 10

Hey everyone,

I just downloaded several genomes from NCBI assembly. Let's say I downloaded all E. coli genomes. After unzipping all files, I'll have several files with the RefSeq accession as the file name. My objective is to batch rename all those individual files and replace by the corresponding organism name. So, for example, for file named GCF_000005845.2.genomic.fna, I would like to replace it for Escherichia coli str. K-12. Could you please help me with this? Thank you

Assembly genome • 2.9k views

ADD COMMENT • link 5.8 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

This might not work for your data. Can you show us the content of the headers of your fastas?

How do you want to handle the cases where 2 sequences share the same strain name?

ADD REPLY • link 6.1 years ago by Joe 21k

0

Entering edit mode

How do you want to handle the cases where 2 sequences share the same strain name?

Perhaps prepend the organism name to the GCF accession that's already in the filename?

You can get the organism name for a given GCF accession in a two-column format using Entrez Direct as shown below. I removed dots and replaced all spaces in the organism name with underscores so that the final filenames will be more manageable.

esearch -db assembly -q 'GCF_000005845.2' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession Organism | sed -r 's/ /_/g; s/\.//g'
GCF_0000058452  Escherichia_coli_str_K-12_substr_MG1655_(E_coli)

ADD REPLY • link 6.1 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Hey guys,

Thank you for your answers. So, I have a folder containing several individual genome fasta files. Each file may correspond to a multi-fasta or to a complete genome. Anyway, each file is related to a given strain, and are named according to the strain's RefSeq accession. Your command should work fine for my data. I have a list.txt comprising all the RefSeq accessions. Is it possible to use this command and the list as a query? Also, after having the two-column format output, how can I write a command to batch rename the RefSeq accession for the given organism name? I guess I should write a python or perl script for that, but I'm no pro in bioinformatics :D Thank you guys again for your time. Cheers

ADD REPLY • link 6.0 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

This is not particularly clear to me. Can you show us a small example of the file structure you have? You can use the tree program to get an easy output (you may need to install it from apt or similar)

ADD REPLY • link 6.0 years ago by Joe 21k

0

Entering edit mode

You don't need a python/perl script for this. You can do this in bash. I made a few assumptions: 1. You are going to use Assembly Accession + Organism name as your new filename 2. All your accessions are unique -- i.e., you don't have duplicate accessions with distinct versions such as Acc1.Ver1, Acc1.Ver2, etc. 3. You will manage to get the formatting as you want in the filenames.txt file using Entrez Direct and standard Unix commands

$ ls GCF*
GCF_000001234.1.genomic.fna  GCF_000005678.1.genomic.fna
$ cat filenames.txt 
GCF_000001234.1_Organism_1
GCF_000005678.1_Organism_2
$ for f in GCF* ; do mkdir -p renamed_files ; x=$(echo $f | cut -f1 -d '.') ; of=$(grep $x filenames.txt) ; cp $f renamed_files/$of.genomic.fna ; done
$ ls renamed_files/
GCF_000001234.1_Organism_1.genomic.fna  GCF_000005678.1_Organism_2.genomic.fna

ADD REPLY • link 6.0 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

So, here's the partial content of my folder

.
|__GCF_003047065.1_ASM304706v1_genomic.fna
|__GCF_002863405.1_ASM286340v1_genomic.fna
|__GCF_000159355.1_ASM15935v1_genomic.fna
|__GCF_000159335.1_ASM15933v1_genomic.fna
...

For each GCF file, there's a unique organism name and I want to fetch it so that I can rename each GCF file for the corresponding organism name. So, maybe I should run the first esearch command you provided, to retrieve a two-column format as the output. This option only works with a single query. Can you provide me a way of having a column with all GCF files at once?

Then, maybe I can use this column as the txt file in the loop you provided

ADD REPLY • link updated 6.0 years ago by finswimmer 16k • written 6.0 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

That's great, thanks!

ADD REPLY • link 6.0 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Hey guys,

Another question: Some of the outputs don't have the strain name. I guess the reason is that the organism name doesn't have that info. For example here https://www.ncbi.nlm.nih.gov/assembly/GCF_003290365.1/. If I use

for f in GCF* ; do term=$(echo $f | cut -f1,2 -d'_') ; esearch -db assembly -q $term | esummary | xtract -pattern DocumentSummary -sep ' ' -element Organism,Strain,AssemblyAccession | sed 's/ /_/g' ; done > filenames.txt

The strain name doesn't appear on filenames.txt. Could you please let me know what I'm doing wrong?

Cheers

ADD REPLY • link updated 5.8 years ago by GenoMax 147k • written 5.8 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

If you have another question, please ask another question. Answers are for answers to the main question only.

ADD REPLY • link 5.8 years ago by Joe 21k

score 2 · Answer 1 · 2018-11-13

Here are the steps:

$ ls -1 GCF*
GCF_000159335.1_ASM15933v1_genomic.fna
GCF_000159355.1_ASM15935v1_genomic.fna
GCF_002863405.1_ASM286340v1_genomic.fna
GCF_003047065.1_ASM304706v1_genomic.fna
$ for f in GCF* ; do term=$(echo $f | cut -f1,2 -d'_') ; esearch -db assembly -q $term | esummary | xtract -pattern DocumentSummary -sep ' ' -element AssemblyAccession,Organism | sed 's/ /_/g' ; done > filenames.txt
$ cat filenames.txt
GCF_000159335.1_Lactobacillus_jensenii_JV-V16_(firmicutes)
GCF_000159355.1_Lactobacillus_johnsonii_ATCC_33200_(firmicutes)
GCF_002863405.1_Lactobacillus_jensenii_(firmicutes)
GCF_003047065.1_Lactobacillus_acidophilus_(firmicutes)
$ for f in GCF* ; do mkdir -p renamed_files ; x=$(echo $f | cut -f1,2 -d '_') ; of=$(grep $x filenames.txt) ; cp $f renamed_files/$of.genomic.fna ; done
$ ls -1 renamed_files/
'GCF_000159335.1_Lactobacillus_jensenii_JV-V16_(firmicutes).genomic.fna'
'GCF_000159355.1_Lactobacillus_johnsonii_ATCC_33200_(firmicutes).genomic.fna'
'GCF_002863405.1_Lactobacillus_jensenii_(firmicutes).genomic.fna'
'GCF_003047065.1_Lactobacillus_acidophilus_(firmicutes).genomic.fna'