Question

Analyzing Unknown Dna Sequence And Suggesting Origin

2

Entering edit mode

14.1 years ago

Mike ▴ 20

I have been given an unknown nucleotide sequence and need to analyze it using bioinformatic methods that have not been explained very well. I need to suggest what the organism is, what it is related to, what genes does it have and how are they organized. Can anyone point me in the right direction of what to do?

I have run a blast, however get a few very different organisms. Then have looked at proteins using ORF finder. After i have run an ORF finder i remember that we were told to selected different proteins found (the large ones I think) and run a blast with these. Should i select the large ones and do these need to be on the same frame? Do i need to run more than one ORF? if so what should I change?

sequence analysis prediction homework • 9.9k views

ADD COMMENT • link updated 14.1 years ago by Michael 55k • written 14.1 years ago by Mike ▴ 20

1

Entering edit mode

What did you blast against? You got a few very different organisms. How good were your alignment scores? Do you think the target organism is among those? Or do you still need to search others? If you have not found it yet, is one of your alignments good enough to assume that it is from a closely related species? The idea to Blast it against a number of different species and look for some reasonable hits and then look for related species to find the best hit sounds like a reasonable iterative approach to me, especially if you have to do it manually.

ADD REPLY • link 14.1 years ago by Chris Evelo 10k

0

Entering edit mode

What did you blast against? You got a few very different organisms? How good were your alignment scores? Do you think the target organism is among those? Or do you still need to search others? If you have not found it yet, is one of your alignments good enough to assume that it is from a closely related species? The idea to Blast it against a number of different species and look for some reasonable hits and then look for related species to find the best hit sounds like a reasonable iterative approach to me, especially if you have to do it manually.

ADD REPLY • link 14.1 years ago by Chris Evelo 10k

score 2 · Answer 1 · 2011-04-12

I would recommend BLASTing against a compilation-based diverse database such as NT or WGS. You did well using an ORF finder, and I believe you should have a minimum gene cut off probably somewhere in the 200bp zone (don't forget to do reverse strand, overlaps are fine). If it is a eukaryote then the reads will not necessarily be in frame and you will likely need to use a similar entropy score you used to determine the frame to also determine if something is intron or exon.

If the reads are truly unrecognized in sequencing I would recommend using a composition-based taxonomical analysis tool such as Phymm or RAIphy.

Ram · Answer 2 · 2011-04-12

I think you should discuss this topic with your course supervisors. We did similar courses in applied bioinformatics and feedback is always important, especially to make things more clear. Maybe you can point the course to this question. It could also be that the methods described in your in the course might be suboptimal. I guess, asking on a forum for advice is generally fine.

Sequence annotation breaks down into two steps:

Gene prediction (note ORF finding is not really sufficient)
Gene annotation

Before, you have to find out what your sequence is, because gene prediction depends on it. First, find out if it is a bacterium, archea, or eukaryote. Everthing else depends on this. The sequence you got is hopefully small enough (we gave the students a bacterial plasmid sequence for example) , and hopefully a prokaryote, that makes things much easier. A rough estimate of the domain can also be done by the size alone (e.g. see here: http://en.wikipedia.org/wiki/Genome#Comparison_of_different_genome_sizes)

Blast-nucleotide against NT, look for 100% sequence identity only, if the organism is in the database, it will give you high sequence identity. Don't look at the protein level here, AA sequence is much more conserved.
If you can't find the organism with 100% cut-off, then it's not in the database, then you have to lower the cutoff
The trick is to exactly not to use ORF sequences to identify the organism but intergenic sequences, those are often less conserved and will yield fewer good hits with high conservation. To get intergenic sequences, use a tool like getORF and use the sequences that are not covered by any ORF, if such exists.
Another idea (but more complicated) is to directly search for 16srDNA sequences and use these for a phylogenetic classification

Now, that you know what kind of organism you are seeing, you have to choose the right methods:

for ORF finding, set the correct genetic code
run gene prediction programs (glimmer, critica, Gismo, GeneMark-HMM, etc.) with the correct gene models for the organism.
search for tRNAs using tRNAscan-SE
annotate the genes using a variety of tools, e.g.:
blastp against NR (on the protein level)
blast against swiss-prot, Trembl
Pfam
Rfam

I didn't put links to the tools but they are generally easy to find. I would prefer to provide the students with an environment with the tools pre-installed and perform the analysis steps at least once as an on-line exercise.