Question

How To Blast A Fasta File That Is Non-16S

1

Entering edit mode

11.9 years ago

pfryling ▴ 30

Hello,

I have a Illumnia generated fasta file that was created using rpoB sequence data. I am looking to BLAST these sequences in order to find the best 3 hits of what organism I am looking at. I know you can do something similar to this in QIIME, using BLASTALL, but with a 16S reference database.

Does anyone know of a rpoB database or a way that I could accurately identify the sequences in my FASTA file?

Any help or advice would be greatly appreciated.

Best Regards, Paul

blast biopython • 5.7k views

ADD COMMENT • link updated 11.2 years ago by Biostar 20 • written 11.9 years ago by pfryling ▴ 30

score 4 · Answer 1 · 2013-01-12

Hi Paul, I'm assuming since you're using rpoB and you mention QIIME that you're interested in identifying amplicons from environmental samples. Yes, you can create your own BLAST database and identify your samples that way. If you're interested in clustering your rpoB sequences and identifying your OTUs for this sequences you can use QIIME.

There are a few ways to go about doing this. You can first use a sequence similarity clustering that is not based on a database to cluster your sequences at a self-determined level of similarity (i.e. 95%, 97%, 99%). Since these methods don't typically identify sequences, you'll have to go and then BLAST your OTU sequences. To do this you can use CD-HIT, USEARCH, etc.

If you're familiar with QIIME you can create your own database. I've done this for numerous markers and it's not particularly difficult. You'll need to have two files: one with your sequence database and one with your corresponding taxonomy. This is similar if you've used MEGAN also. You need to use the sequence database to identify your sequences to OTU and then you need to name your OTUs. If you refer to the QIIME tutorial, make reference to the section "Step 3: Assign Taxonomy" and instead of using the the RDP 16S database you can add your own from the command line.

I'm not aware of any public rpoB sequence database that is already formated for QIIME. I am aware of this paper "Complete rpoB gene sequencing as a suitable supplement to DNA–DNA hybridization for bacterial species and genus delineation" which has a rpoB database in the supplementary materials which you could use in lieu of creating your own from NCBI, EMBL, etc.

Best of luck.

score 3 · Answer 2 · 2013-01-12

If you have a protein coding gene, use orthology databases, such as Kegg Orthology (see KO entry for rpoB ). However, in this particula example, getting the sequence data out of KEGG isn't trivial. The other option is to use domain definition instead of full gene, which should give you a better coverage at lower specificity - see PFAM's definition of RNApolRpb2_6.

Personally, in case of protein-coding gene, I would use protein sequences as reference and BLASTX instead of BLASTN.

score 2 · Answer 3 · 2013-01-15

2

Entering edit mode

11.9 years ago

pfryling ▴ 30

Hey Josh,

First, thank you everyone for the advice.

I started with 207175 rpoB sequences and after using UCLUST to cluster the sequences I ended up with 2334 OTU's based on 97% similarity.

After following the mentioned steps to in the QIIME tutorial, I am still having trouble creating my own reference database. I was also unable to locate the rpoB database in the supplementary materials. Are you refering to supplementary table S3. Isolates Investigated by rpoB sequence intraspecies similarity?

Thanks again.

Best Regards, Paul Fryling

ADD COMMENT • link 11.9 years ago by pfryling ▴ 30

1

Entering edit mode

Hi Paul, You should add this information to your question above or comment on my answer, as this is not an answer and it will help others with similar questions if they can follow your thread.

Sounds like you have good results clustering your OTUs. Excellent, half of the analysis is finished.

As far as identifying your OTUs... You need a sequence database with a corresponding taxonomy database for QIIME. If you're not able to access any previously published rpoB sequences, you can do as Michael mentioned above and create a sequence database from NCBI, EMBL, etc. This is easy to do. In addition to your curated rpoB sequences you'll need to parse the sequence taxonomy (the names of the corresponding organisms) from NCBI to a separate taxonomy file for QIIME. If you're unsure of the text format, look at the existing databases in QIIME and make sure your text files are in the same format. Then at the command line substitute your database for rpoB instead of the QIIME database you would normally use (GreenGenes, RPD database, etc.). You will then get a taxonomy identification which you can map on a phylogenetic tree (TopiaryExplorer) or use sequence divergence to look at alpha and beta diversity in your samples (UniFrac).

Let me know if you have any other questions.

ADD REPLY • link 11.9 years ago by Josh Herr 5.8k

score 0 · Answer 4 · 2013-01-12

0

Entering edit mode

11.9 years ago

Michael 55k

Cant you just download all sequences from NCBI: http://www.ncbi.nlm.nih.gov/protein?term=RpoB via send to file? Then create a blast database from the export.

ADD COMMENT • link 11.9 years ago by Michael 55k