How To Find Mirnas In Deep Sequencing Data When Genomic Sequence Unavailable
4
7
Entering edit mode
13.1 years ago
Andrew W ▴ 290

I have been given deep sequencing data (Illumina/Solexa) for the short RNAs from the tissue of an organism whose genome has not been sequenced. From what I was told, the reads come from RNA that was size-selected (< 100 bps) by extracting from a gel.

I would like to create a list of all the microRNAs in the sample. I am most interested in miRNAs that are unique to my organism- I don't know the exact phylogenetic relationship, but AFAIK, nothing closely related (e.g. as related as mouse and rat) has been studied.

There have been a number of studies that ask the same question, but in organisms where the genomic sequence is already known. Programs like miRDeep can then be used to map the reads onto the genome and predict whether the reads come from microRNAs.

One option that was suggested was to run miRDeep using organisms that have genomic sequence. Because miRNAs are often conserved, I should find some miRNAs among my reads that way. One problem with this approach is that I am unlikely to find what I'm most interested in (miRNAs unique to my species), though I may find some that are 'archaic' (present in the genome of these other organisms but no longer expressed- or at least detected under conditions tested so far).

To begin, I used FreClu to generate a unique set of reads. I set a minimum read count of 5. I tried filtering out non-miRNA sequence: I BLASTed against Rfam and fRNAdb to identify non-miRNA short RNAs. I BLASTed against an EST-based transcriptome to identify any mRNA contaminants.

After running miRDeep as suggested, I did indeed find many miRNAs that are conserved. I also used the seed sequence (bps 2-7 or 2-8) to find putative family relationships with miRBase members.

In the end, though, I still have lots of sequences that don't have obvious hits to other small RNAs. Presumably some of them could be the unique miRNAs that I am interested in. At the very least, they're sequences I cannot classify. Can someone suggest another computational approach I could use to try identifying which of these unclassified seqs could be miRNAs? I experimented with trying to match major and minor products among the reads, but what I ended up with was very noisy (too many possible matches to be useful and no luck at finding a match threshold that would give sensible results).

Thank you in advance for your help,

Andrew

Edit (2011-11-15): Here are some more details on the deep sequencing data. I apologise to Larry and anyone who was misled by the original description and lack of details. The data was given to me a long time ago and there was initially some confusion about its makeup, which I obviously internalised (I should have reviewed the e-mails again rather than rely on memory- again, I apologise for this mistake). The reads are single end reads from an Illumina/Solexa (not 454) sequencer and are maximum 36 bps. The extracted gel band was supposed to contain 15-30 bp seqs, though I was told that it is not unexpected for larger seqs (> 40 bps) to be extracted as well. Most of my reads are in the 19-23 bp range (there are some that are the max [36 bps], though I was not able to classify most of them).

mirna next-gen sequencing • 7.0k views
ADD COMMENT
2
Entering edit mode
13.1 years ago
Micans ▴ 270

Some ideas from our lab: 1) Do you have cDNAs / ESTs for your species? You could map to those, see whether they are primary transcripts with hairpin structures. 2) At 100bps you might still get hairpins in your reads. 3) Stack reads, look for clean 5' start sites in the stack. 4) See whether there are both 5p and 3p arms of the same hairpin stem in your sample - but you've done that already.

ADD COMMENT
0
Entering edit mode

I forgot, you could also use http://www.ebi.ac.uk/enright-srv/MapMi/, developed in the Enright lab. It is a tool "designed to locate miRNA precursor sequences in existing genomic sequences (e.g Ensembl and Ensembl Metazoa), using potential mature miRNA sequences as input". This would be one way to go about Larry Parnell's suggestion.

ADD REPLY
0
Entering edit mode

Thank you for the suggestions.

1) Yes, there is an EST-based transcriptome which I've used to filter out any degraded mRNA in the sample. It's been a while, but I don't recall there being much. But I assumed those hits meant degraded mRNA. I will check whether they are consistent with non-mature miRNAs.

ADD REPLY
0
Entering edit mode

Re: 3) I noticed that miRDeep has been updated (miRDeep2). In the paper they state:

"miRDeep2 in contrast performs excision by scanning the genome for stacks of reads. We define a stack as one or more reads that map to the exact same 50 and 30 positions in the genome."

I will repeat with the new version, perhaps something else turns up.

ADD REPLY
0
Entering edit mode

Regarding MapMi, it looks interesting, but I'm not sure it will help in detecting novel miRNAs. In the paper, they write:

"Our primary goal is not the discovery of novel miRNAs but the mapping of validated miRNAs in one species to their most likely orthologues in other species."

and

"This is particularly useful for recently sequenced genomes where miRNA information may be absent or sparse"

ADD REPLY
1
Entering edit mode
13.1 years ago

MicroRNA predictors are certainly one approach, but I'll offer another: synteny. If a close relative has been sequenced, use it to find microRNAs in conserved locations - ie within or next to the same protein-coding genes. This would work very well for mouse-human comparisons with a last common ancestor some 70 million years ago (mya). Do you have such a relative? Sure, there are primate and even human-specific microRNAs that a rodent-primate or mouse-human comparison will miss, but many will be found very easily simply looking for conserved gene order. You could run tests on those regions, say with the tools/approaches mentioned by RM and micans, to see how well such syntenic regions score (likely high), and then set a score threshold in order to test other regions of your genome.

ADD COMMENT
0
Entering edit mode

Thanks, Larry. This is an interesting suggestions. I'm not sure I completely understand all the parts of it, though. When you say "use it to find microRNAs in conserved locations", I assume you mean I should look for miRNAs in the closely related species. If I were able to identify those, I suppose I might be able to match them to unidentified seqs from my samples. I'm not sure there will be anything unique that turns up that way, though. And it's the sort of thing I would have hoped miRBase would find. But perhaps the genome matching criteria were too strict to allow (cont...)

ADD REPLY
0
Entering edit mode

(cont...) detection in the searched organism's genome.

My apologies if I've not understood things as intended.

ADD REPLY
0
Entering edit mode

Look at the EntrezGene entries for human and mouse SREBF2/Srebf2. Both species show the presence of a Mir-33 gene within the Srebf2 gene, and near the 3'end. You said your genome is new, not been sequenced before, and so the flanking data (outside the miR) could be used to assign the read to a putative location on the genome of the near relative, which can then be used for putative identification - eg, miR-33-like because it has strong similarity to miR-33 and flanking matches Srebf2 intron seq.

ADD REPLY
0
Entering edit mode

"I would have hoped miRBase would find."

That should read "I would have hoped miRDeep would find."

ADD REPLY
0
Entering edit mode

"and so the flanking data (outside the miR) could be used to assign the read to a putative location on the genome of the near relative"

I'm sorry, I'm still a bit confused. I don't understand whence I am getting this flanking sequence. I have short reads (most ~22 bps) from my RNA deep sequencing samples and an EST-based transcriptome that has already been published (and very few of the reads mapped to this transcriptome- I don't recall exact number, but I believe it was less than 50 read seqs that mapped). I don't have any genomic sequence.

ADD REPLY
0
Entering edit mode

Fine. I thought that the reads would be longer than 22 bps. One could use sequence outside of the (potential) miR to map the entire read to a syntenic region in the genome of a related species.

ADD REPLY
0
Entering edit mode

Ok, thanks, I understand now. Sorry, I should have been more precise in the original post. It's a good idea and perhaps I will find a way to use it, or at least someone with longer reads who sees this entry can give it a try.

ADD REPLY
0
Entering edit mode
13.1 years ago
Rm 8.3k

My two cents:

One thing I would suggest is to use programs like miRNAkey and search your reads against Entire mirbase. To maximise the number of miRNA identfied from your sample.

If permits : Test using the read sequence with top hits (atleast) design PCR primers accordingly (Iam not a experimental biologist) to test if they get amplified in your tissue/genome of interest......

ADD COMMENT
0
Entering edit mode

I remember looking at miRNAkey a while back and not thinking it would be useful, but I will check again. I agree, the best thing to do here is test candidates identified by in silico analysis in the lab. I've told the PI who gave me the seqs that it will be very difficult to come up with an miRNA'ome based on this data and its main utility will be in directing wet-lab research.

ADD REPLY
0
Entering edit mode
13.1 years ago

I agree with some points already that are mentioned here: by mapping the reads on a related genome or on all the sequences in mirbase you should be able to find related miRNAs. And "related" does not mean "identical", so this could be the best option. However, as you already pointed out, finding the really species specific miRNAs will be hard.

The fact that you still have a lot of sequences you cannot annotate is completely normal. That is also the case for organisms for which the genome sequence is available. Transcription is noisy, I strongly believe many ncRNAs are not classified yet and lets not forget that sequencing still results in quite some errors.

One wild guess I would try in order to find miRNAs in the unclassified reads that remain after the previous analyses would be the following: as miRNAs are the result of the processing of a hairpin, there are often 2 mature RNAs, coming from both arms of the hairpin. You could look in your data for pairs of RNA sequences that show significant similarity to the complement of each other. In plants, I think this could work, as there are more basepairs in these hairpins. In animals however, there are quite some bulges and loops, so that might not work. Still it doesn't guarantee you anything I believe.

Another point I think you should consider: can't you sequence the genome? These days, that is not too hard anymore, and it would save you a lot of trouble.

Good luck!

ADD COMMENT
0
Entering edit mode

Thanks, I think I will need it :)

I've asked about sequencing the genome, the PI said this might happen in the near future, but they would like to publish before then.

As regards looking for pairs, if I understand you correctly, this is what I mentioned in the last sentence. The BLAST results were nearly impossible to interpret. I suspect that it's going to be very difficult to find the correct pairs, especially given the inherent errors in the sequence data. I think, like you say, it might work in plants. Maybe if I tried using Smith-Waterman that would give me better (cont..)

ADD REPLY
0
Entering edit mode

(cont...) results, I'm not sure. Before I try it again, though, I will test the miRBase entries to see how easy they are to pair together. Maybe it's a problem even with miRNAs, even with clean data.

ADD REPLY

Login before adding your answer.

Traffic: 2059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6