Question

Searching For Proteins In The New Oyster Genome

1

Entering edit mode

12.2 years ago

cdsouthan ★ 1.9k

I wanted to check the new Crassostrea gigas genome for similarities to a small number of proteins I have an interest in the evolution of. This proved difficult as Crassostrea comes up with an invalid organism selection for TBLASTN vs WGS and the SRA offers only BLASTN.

So is there a way to search for ORF similarities on a small scale or do I have to wait until the assembly gets into UCSC and/or Ensembl ?

genome • 3.9k views

ADD COMMENT • link updated 12.2 years ago by sr320 • 0 • written 12.2 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

You could translate the genome in all 6 ORFs and use this as a local database for blast?

ADD REPLY • link 12.2 years ago by Whetting ★ 1.6k

3

Entering edit mode

No! use tblastx, or tblastn instead. Download the draft assembly - if available - and build your own nucleotide database from it. Then use local blast.

ADD REPLY • link 12.2 years ago by Michael 55k

0

Entering edit mode

duh...sorry, I must have been sleeping this morning!!

ADD REPLY • link 12.2 years ago by Whetting ★ 1.6k

score 2 · Answer 1 · 2012-09-22

2

Entering edit mode

12.2 years ago

Michael 55k

The genbank accession according to the Nature paper (Zhang et al.) is AFTI01000000. While the link in the online paper doesn't work, searching ncbi nucleotide (http://www.ncbi.nlm.nih.gov/nuccore?term=AFTI01000000) reveals 7659 entries; download all of them e.g. as FASTA file and create a local Blast database. Then run tblastx or tblastn to search for the genes. In addition you can use available EST sequences and make a blast database from those. Following Ketil, I wouldn't use de-novo gene predictions, the quality of predictions without proper training set and no manual curation will not justify the effort (it will predict anything but the genes you are looking for).

ADD COMMENT • link 12.2 years ago by Michael 55k

1

Entering edit mode

I guess what I was getting at before, and this is just MHO, but a broken link to a bunch of contigs in GenBank is really not acceptable for a paper in Nature. I guess I expect more when it comes to genome papers: actual data, annotations, etc. Bioinformatic algorithms and pipelines for validation of presented results in papers. One immense disappointment for me was the lack of a program or algorithm in the Iverson et al Paper in Science this last spring. Here's a great blog post from Titus Brown about anecdotal science both from the Iverson paper and beyond. Kinda feeling like the oyster paper is falling into this category of not providing public resources to advance science.

ADD REPLY • link 12.2 years ago by Josh Herr 5.8k

0

Entering edit mode

Glad you liked the blog (I was afraid posting the link was going to get this "closed"). I won't reiterate comments on the general patchiness of genome finishing and assemblies languishing for years without updates but in this case you'd have thought BGI would have really wanted to pull this one all the way through to a good set of public ORFs as a SOAPdenovo showcase.

ADD REPLY • link 12.2 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

Thanks for all the answers. Josh hits the nail on the head, Nature (and the referees) should have mandated the deposition of the genome at the very least as blastable contigs. We should not be expected to do this unfinished job. Lets hope it finds its way into Ensembl eventually

ADD REPLY • link 12.2 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

The authors did submit the contig sequences to genbank, that's why there is an accession number. That is sufficient to make them searchable. If or when there will be a public genome annotation, I don't know.

ADD REPLY • link 12.2 years ago by Michael 55k

0

Entering edit mode

This is a partial apology in the sense, either I mistyped or its been fixed, but I can now find Crassostrea as a taxonomic select in WGS for a TBLASTN. It then took a few minutes to search P56817, whack contig AFTI01022267 at e-23 and run GENSCAN got the 520 aa ORF right off the bat. I hope I can consequently be forgiven for answering my own question but these contigs had been on hold since 2011. The AFTI01000000 number was dead-linked in the paper because it links to nested set of 7659 contigs and the relationship of these to the SRA links is unclear. It would of still been better it they had asked the NCBI to put it thought GNOME to get the XP ORFs into the database and/or the Nature editors should have said, "good stuff but assemble it please" (see comments at http://cdsouthan.blogspot.se/2012/09/the-pearl-of-oyster-genome-is-missing.html)

ADD REPLY • link 12.2 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

By the way, nice blog!

ADD REPLY • link 12.2 years ago by Josh Herr 5.8k

score 1 · Answer 2 · 2012-09-21

1

Entering edit mode

12.2 years ago

Josh Herr 5.8k

I agree with Michael above.

Something that you can do relatively quickly (I think) would be to do gene prediction using something like AUGUSTUS. You'll need to supply a training set or genome, but I have heard people have had good luck using the Human genome for metazoans. From that output you'll have a list of putative proteins which you can BLAST against. You might not find what you are looking for this way, but it's one strategy and it would proceed pretty quickly depending on the size of the draft genome.

I recently predicted genes on a freshly sequenced de novo genome assembly with no real close relative and it worked well and was a swift analysis.

ADD COMMENT • link 12.2 years ago by Josh Herr 5.8k

0

Entering edit mode

Thanks, I've done exactly this before, using GENSCAN but you need the contigs to tblastn against in the first place to lock down a few partial ORF matches (I cant even find a draft assembly). I don't have the wherewithal or inclination to pipeline the whole data set. just for the half dozen or so proteins I am interested in so I guess I'll just have to wait and see if and when NCBI/UCSC/Ensembl do this.

ADD REPLY • link 12.2 years ago by cdsouthan ★ 1.9k

0

Entering edit mode

I'm absolutely sorry, I thought you had a draft genome already.

I saw the paper that was just published, but I guess was under the assumption that when there is a paper in Nature, the authors actually release the draft genome sequence to the public. I looked and can't find anything except an old EST database that doesn't even look up and running anymore.

ADD REPLY • link 12.2 years ago by Josh Herr 5.8k

0

Entering edit mode

If you're looking for some particular proteins, I would go with tblastx as above, rather than trying to predict genes de novo. The latter will IMO require a very reliable genome as well as RNAseq or EST evidence, and even then, you will have ambiguities about gene boundaries and missing exons, etc.

ADD REPLY • link 12.2 years ago by Ketil 4.1k

score 0 · Answer 3 · 2012-09-27

0

Entering edit mode

12.2 years ago

sr320 • 0

The genome and proteome are available directly from GigaScience.

http://gigadb.org/pacific_oyster/

Zhang, G; Fang, X; Guo, X; Li, L; Luo, R; Xu, F; Yang, P; Zhang, L; Wang, X; Qi, H; Zhu, Y; Yang, L; Huang, Z (2012): Genomic data from the Pacific oyster (Crassostrea gigas). GigaScience. http://dx.doi.org/10.5524/100030

The protein file (fasta) - oyster.v9.glean.final.rename.gff.pep.gz You could download this and do a local blastp.

You can also blast the proteins online at http://oysterdb.cn/blast.html

ADD COMMENT • link 12.2 years ago by sr320 • 0

0

Entering edit mode

Aha, thanks. It would have been nice to have had these links in the paper. Just for the record, we agree (or probably both used GENESCAN) 100% on your protein OYG_10007802. Are you in discussions about an eventual Ensembl inclusion ?

ADD REPLY • link 12.2 years ago by cdsouthan ★ 1.9k