Entering edit mode
6.1 years ago
genomes_and_MGEs
▴
10
Hey everyone,
I just downloaded several genomes from NCBI assembly. Let's say I downloaded all E. coli genomes. After unzipping all files, I'll have several files with the RefSeq accession as the file name. My objective is to batch rename all those individual files and replace by the corresponding organism name. So, for example, for file named GCF_000005845.2.genomic.fna, I would like to replace it for Escherichia coli str. K-12. Could you please help me with this? Thank you
This might not work for your data. Can you show us the content of the headers of your fastas?
How do you want to handle the cases where 2 sequences share the same strain name?
Perhaps prepend the organism name to the GCF accession that's already in the filename?
You can get the organism name for a given GCF accession in a two-column format using Entrez Direct as shown below. I removed dots and replaced all spaces in the organism name with underscores so that the final filenames will be more manageable.
Hey guys,
Thank you for your answers. So, I have a folder containing several individual genome fasta files. Each file may correspond to a multi-fasta or to a complete genome. Anyway, each file is related to a given strain, and are named according to the strain's RefSeq accession. Your command should work fine for my data. I have a list.txt comprising all the RefSeq accessions. Is it possible to use this command and the list as a query? Also, after having the two-column format output, how can I write a command to batch rename the RefSeq accession for the given organism name? I guess I should write a python or perl script for that, but I'm no pro in bioinformatics :D Thank you guys again for your time. Cheers
This is not particularly clear to me. Can you show us a small example of the file structure you have? You can use the
tree
program to get an easy output (you may need to install it fromapt
or similar)You don't need a python/perl script for this. You can do this in bash. I made a few assumptions: 1. You are going to use Assembly Accession + Organism name as your new filename 2. All your accessions are unique -- i.e., you don't have duplicate accessions with distinct versions such as Acc1.Ver1, Acc1.Ver2, etc. 3. You will manage to get the formatting as you want in the
filenames.txt
file using Entrez Direct and standard Unix commandsSo, here's the partial content of my folder
For each GCF file, there's a unique organism name and I want to fetch it so that I can rename each GCF file for the corresponding organism name. So, maybe I should run the first esearch command you provided, to retrieve a two-column format as the output. This option only works with a single query. Can you provide me a way of having a column with all GCF files at once?
Then, maybe I can use this column as the txt file in the loop you provided
That's great, thanks!
Hey guys,
Another question: Some of the outputs don't have the strain name. I guess the reason is that the organism name doesn't have that info. For example here https://www.ncbi.nlm.nih.gov/assembly/GCF_003290365.1/. If I use
The strain name doesn't appear on filenames.txt. Could you please let me know what I'm doing wrong?
Cheers
If you have another question, please ask another question. Answers are for answers to the main question only.