I have been given deep sequencing data (Illumina/Solexa) for the short RNAs from the tissue of an organism whose genome has not been sequenced. From what I was told, the reads come from RNA that was size-selected (< 100 bps) by extracting from a gel.
I would like to create a list of all the microRNAs in the sample. I am most interested in miRNAs that are unique to my organism- I don't know the exact phylogenetic relationship, but AFAIK, nothing closely related (e.g. as related as mouse and rat) has been studied.
There have been a number of studies that ask the same question, but in organisms where the genomic sequence is already known. Programs like miRDeep can then be used to map the reads onto the genome and predict whether the reads come from microRNAs.
One option that was suggested was to run miRDeep using organisms that have genomic sequence. Because miRNAs are often conserved, I should find some miRNAs among my reads that way. One problem with this approach is that I am unlikely to find what I'm most interested in (miRNAs unique to my species), though I may find some that are 'archaic' (present in the genome of these other organisms but no longer expressed- or at least detected under conditions tested so far).
To begin, I used FreClu to generate a unique set of reads. I set a minimum read count of 5. I tried filtering out non-miRNA sequence: I BLASTed against Rfam and fRNAdb to identify non-miRNA short RNAs. I BLASTed against an EST-based transcriptome to identify any mRNA contaminants.
After running miRDeep as suggested, I did indeed find many miRNAs that are conserved. I also used the seed sequence (bps 2-7 or 2-8) to find putative family relationships with miRBase members.
In the end, though, I still have lots of sequences that don't have obvious hits to other small RNAs. Presumably some of them could be the unique miRNAs that I am interested in. At the very least, they're sequences I cannot classify. Can someone suggest another computational approach I could use to try identifying which of these unclassified seqs could be miRNAs? I experimented with trying to match major and minor products among the reads, but what I ended up with was very noisy (too many possible matches to be useful and no luck at finding a match threshold that would give sensible results).
Thank you in advance for your help,
Andrew
Edit (2011-11-15): Here are some more details on the deep sequencing data. I apologise to Larry and anyone who was misled by the original description and lack of details. The data was given to me a long time ago and there was initially some confusion about its makeup, which I obviously internalised (I should have reviewed the e-mails again rather than rely on memory- again, I apologise for this mistake). The reads are single end reads from an Illumina/Solexa (not 454) sequencer and are maximum 36 bps. The extracted gel band was supposed to contain 15-30 bp seqs, though I was told that it is not unexpected for larger seqs (> 40 bps) to be extracted as well. Most of my reads are in the 19-23 bp range (there are some that are the max [36 bps], though I was not able to classify most of them).
I forgot, you could also use http://www.ebi.ac.uk/enright-srv/MapMi/, developed in the Enright lab. It is a tool "designed to locate miRNA precursor sequences in existing genomic sequences (e.g Ensembl and Ensembl Metazoa), using potential mature miRNA sequences as input". This would be one way to go about Larry Parnell's suggestion.
Thank you for the suggestions.
1) Yes, there is an EST-based transcriptome which I've used to filter out any degraded mRNA in the sample. It's been a while, but I don't recall there being much. But I assumed those hits meant degraded mRNA. I will check whether they are consistent with non-mature miRNAs.
Re: 3) I noticed that miRDeep has been updated (miRDeep2). In the paper they state:
"miRDeep2 in contrast performs excision by scanning the genome for stacks of reads. We define a stack as one or more reads that map to the exact same 50 and 30 positions in the genome."
I will repeat with the new version, perhaps something else turns up.
Regarding MapMi, it looks interesting, but I'm not sure it will help in detecting novel miRNAs. In the paper, they write:
"Our primary goal is not the discovery of novel miRNAs but the mapping of validated miRNAs in one species to their most likely orthologues in other species."
and
"This is particularly useful for recently sequenced genomes where miRNA information may be absent or sparse"