Entering edit mode
4.3 years ago
ARich
▴
130
Dear Biostar Users,
I would like to generate phylogenetic tree from ~1000 bacterial genomes. For this purpose i would like to extact highly conserved 16S rRNA region of these genomes.
The information I have is something like below, where I have genome name and NC id.
Acaryochloris marina MBIC11017, NC_009925
Is there any way to perform automated extaction of 16S conserved region for these ~1000 genomes.
Looking forward for a solution.
Thanks
I am not answering your question directly but want to mention an alternate option. You may want to download the 16S RNA blast indexes made available by NCBI here (Warning: large download). Use
blastdbcmd
fromblast+
to dump fasta format sequence out and then pick out ones you need. This is a curated dataset and likely will have the best sequences available for organisms you can find.Thank you for the suggestion. I tries your suggestion as below: 1. I first downloaded all the 16S RNA database from NCBI . 2. Then I using my genome list tries to extract the 16S sequences for the given genome list using following command
blastdbcmd -db \ 16S_ribosomal_RNA \ -entry all \ -outfmt "%g;;%t" | \ grep -F "${MY_GENOME-LIST}" | \ awk -F";;" '/16S \ /{print $1}' | \ blastdbcmd -db 16S_ribosomal_RNA \ -entry_batch - \ -out seq.fasta
The problem here i have names for 400 genomes names in the file but in the end I am able to extract sequences for only 200. I did check why this is happening basically some of the enteries of genomes are missing in this ncbi database which inturn is missed in grep -F step. So, the question is, is it normal that the 16S database from NCBI is missing 16S regions entries for some of the genomes?
Thank you in advance.
16S rRNA blast database indexes are representative and curated, i.e. they do not contain every genome available in NCBI database.
I guess the easiest way would be to download their annotated genomes with eutils and extract the 16S regions. Maybe Silla couldbe useful but I'm not sure it links to the genome refseq IDs.
Thank you for the reply. By downloading the whole genome and then extracting 16SrRNA would not be over engineer? Do you know a way where i can just extract only 16S rRNA. I am not sure if Silva contain information for all genomes I am working with but I am sure that NCBI (genbank /refseq) have these genomes.
Cant we directly extact 16S from eutils? if yes then how?
Many thanks!
I don't know since it's assemblies. I'm not sure if you can download just the
.ffs
files.