Classify Conserved Non-Coding Sequences In Cross-Species Comparisons
3
5
Entering edit mode
14.5 years ago

I am doing a bunch of deep comparisons with the lastz tool, between a few distantly-related species. I can find quite a few conserved sequences that I wish to classify them as coding vs. non-coding. This is relatively straightforward with the gene predictions. For non-coding stuff, I can already see quite a lot of tRNAs, centromeric repeats and some ancient retro-transposons. In other words, the sequences can be further divided into sub-classes.

Now what I can think of is to compare these sequences to NCBI nr database, and hope to get some textual annotations. But is there a better way?

alignment classification conservation comparative • 5.3k views
ADD COMMENT
1
Entering edit mode

I am not sure how this will influence lastz ability to detect conserved stretches of DNA, but what about running lastz with already repeat-masked genomes? That way you should still get some coding sequences but the non-coding ones should be more interesting than common repeats. One can also think about clustering sequences with uclust or CD-hit before blasting.

ADD REPLY
5
Entering edit mode
14.5 years ago
Neilfws 49k

This is quite a challenge. I don't think BLAST versus NCBI nr is the best solution. If you want to BLAST multiple queries you'll need a local BLAST installation and BLAST vs. nr is very slow, unless you have access to parallelization (e.g. a cluster). Also, the annotation that you get back will only be as good as the description line in the hit sequence - which is frequently not very good (or even wrong) and difficult to parse.

A couple of suggestions:

  1. Some of the features that you describe (RNA genes, pseudogenes) are annotated in online genome browsers (e.g. UCSC, Ensembl). You can retrieve them by fetching the appropriate chromosomal "slice" and its annotations using the start/end coordinates of your alignment. There are various ways to do this - see the genome browser websites for details.
  2. Another option would be to identify the type of features that you'd like to retrieve, then find tools (online or local) to search the sequence corresponding to your aligned region. For example, tRNAScan, for tRNA genes. The downside to this is that you would need to locate, download and install lots of diverse software, then figure out how to use it and how to parse the output.

As suggested by darked89, you could also cluster the sequences from the aligned regions. Provided they have sufficient similarity, that will result in a kind of classification (based on sequence), even though you don't (yet) know the function. You could then use just one representative from the cluster to search other databases for annotation, which would cut down on the work.

ADD COMMENT
1
Entering edit mode
14.5 years ago
Ning-Yi Shao ▴ 390

I think that your results are not only sequences alignments conserved, but also probably secondary structure conserved if they are transcribed. So, you might count some results about the structure features of the non-coding transcription, by the tool similar as locaRNA.

ADD COMMENT
1
Entering edit mode
13.8 years ago

Gill Bejerano had a paper on clustering conserved noncoding DNA a few years back:

Into the heart of darkness: large-scale clustering of human non-coding DNA. Bioinformatics. 2004 Aug 4;20 Suppl 1:i40-8. http://bioinformatics.oxfordjournals.org/content/20/suppl_1/i40.abstract

Unfortunately, it does not look like source code is not available to download, but may be available on request.

ADD COMMENT

Login before adding your answer.

Traffic: 1978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6