Question

Classify Conserved Non-Coding Sequences In Cross-Species Comparisons

5

Entering edit mode

14.5 years ago

Haibao Tang 3.0k

I am doing a bunch of deep comparisons with the lastz tool, between a few distantly-related species. I can find quite a few conserved sequences that I wish to classify them as coding vs. non-coding. This is relatively straightforward with the gene predictions. For non-coding stuff, I can already see quite a lot of tRNAs, centromeric repeats and some ancient retro-transposons. In other words, the sequences can be further divided into sub-classes.

Now what I can think of is to compare these sequences to NCBI nr database, and hope to get some textual annotations. But is there a better way?

alignment classification conservation comparative • 5.4k views

ADD COMMENT • link updated 13.8 years ago by Casey Bergman 18k • written 14.5 years ago by Haibao Tang 3.0k

1

Entering edit mode

I am not sure how this will influence lastz ability to detect conserved stretches of DNA, but what about running lastz with already repeat-masked genomes? That way you should still get some coding sequences but the non-coding ones should be more interesting than common repeats. One can also think about clustering sequences with uclust or CD-hit before blasting.

ADD REPLY • link 14.5 years ago by Darked89 4.7k

score 5 · Answer 1 · 2010-05-18

This is quite a challenge. I don't think BLAST versus NCBI nr is the best solution. If you want to BLAST multiple queries you'll need a local BLAST installation and BLAST vs. nr is very slow, unless you have access to parallelization (e.g. a cluster). Also, the annotation that you get back will only be as good as the description line in the hit sequence - which is frequently not very good (or even wrong) and difficult to parse.

A couple of suggestions:

Some of the features that you describe (RNA genes, pseudogenes) are annotated in online genome browsers (e.g. UCSC, Ensembl). You can retrieve them by fetching the appropriate chromosomal "slice" and its annotations using the start/end coordinates of your alignment. There are various ways to do this - see the genome browser websites for details.
Another option would be to identify the type of features that you'd like to retrieve, then find tools (online or local) to search the sequence corresponding to your aligned region. For example, tRNAScan, for tRNA genes. The downside to this is that you would need to locate, download and install lots of diverse software, then figure out how to use it and how to parse the output.

As suggested by darked89, you could also cluster the sequences from the aligned regions. Provided they have sufficient similarity, that will result in a kind of classification (based on sequence), even though you don't (yet) know the function. You could then use just one representative from the cluster to search other databases for annotation, which would cut down on the work.

score 1 · Answer 2 · 2010-05-18

1

Entering edit mode

14.5 years ago

Ning-Yi Shao ▴ 390

I think that your results are not only sequences alignments conserved, but also probably secondary structure conserved if they are transcribed. So, you might count some results about the structure features of the non-coding transcription, by the tool similar as locaRNA.

ADD COMMENT • link 14.5 years ago by Ning-Yi Shao ▴ 390

score 1 · Answer 3 · 2011-02-10

Gill Bejerano had a paper on clustering conserved noncoding DNA a few years back:

Into the heart of darkness: large-scale clustering of human non-coding DNA. Bioinformatics. 2004 Aug 4;20 Suppl 1:i40-8. http://bioinformatics.oxfordjournals.org/content/20/suppl_1/i40.abstract

Unfortunately, it does not look like source code is not available to download, but may be available on request.