I have two miRNA-seq dataset, already removing adapters appeared in raw data. What I want to do is to detect the differential expression between two samples(no replicates, I know it is not a good experimental design). The pipeline: 1)remove contaminations like mRNA, tRNA, snoRNA and so on; 2) select reads length between 18-26; 3) mapping to mature miRNA downloaded from miRBasev20 using bwa; 4) detect the differential expression between two samples using DESeq. However, I cannot do the first step, is there any mRNA, tRNA, snoRNA database downloaded online and is this step operated in linux or windows? And is this pipeline reasonable or not, do you have any more suggestions?
Map the reads against a complete reference genome and not only
against miRBase microRNAs.
Quantify microRNA expression using e.g. bedtools intersect or featureCounts
Use DESeq to find differentially expressed microRNAs
I don't think you have to filter out reads of length between 18-26. The length fraction you get after clipping should be in that range. The microRNA-Seq protocol should result in this length anyway.
Also reads mapping to rRNAs, mRNAs, etc. should be no problem. They map to these regions... no problem for your analysis.
Important: Allow multiple hits of each read when mapping! there are multiple copies of some microRNAs in the genome and if you do not allow it, the results might be misleading, or wrong.
Thanks for so detailed suggestion, and I'm sorry for reply later because of time lag.
Take one sample example,
I don't think you have to filter out reads of length between 18-26
In my case, after removingadapter(TGGAATTCTCGGGTGCCAAGGAACTC), Do FastQC check, there is a peak between 32-33 : mainly compose of GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCT; TGCCTATGCTGAAACCCAGAGGCTGTTTCTGAGC; GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT.
so should I just let it alone, and do the mapping step?
Use DESeq to find differentially expressed microRNAs
this step, which .gtf file should I select for the reads count? This species total .gtf file including mRNA, tRNA, miRNA annotation, or just gtf file including miRNA only downloaded from miRBase?
First of all, you should find a peak somewhere around 24nt. I do not know the sequencing protocol, you use, but normally, when doing microRNA-Seq, molecules of length 18-30 (more or less) are filtered and sequenced. If you have a different protocol and a bigger length region (e.g. < 100) was sequenced, I would also filter for read of length between 18-30. Otherwise you might count microRNA precursors (~70nt).
You don't have to filter out reads overlapping with annotated mRNAs, or ncRNAs. They just map to their locus of origin and give you information. There are known microRNAs overlapping with snoRNAs or tRNAs. Read for example [1] or [2]. So, when deleting these reads, you will loose information. Also reads mapping to mRNAs... you might loose mirtrons (microRNAs in introns). I highly recommend to just map all reads. When filtering data, or mapping to just a subset (miRBase precursors), you will always loose information. Especially, if you want to predict new microRNAs. ;)
Your question about what .gtf file to use. What I would do:
Do some read counting fist. How many reads overlap with mRNAs? How
many reads overlap with different kinds of ncRNAs (rRNAs, tRNAs,
snoRNAs, etc.)? How many reads overlap with microRNAs? And so on....
That gives you a feeling for your data. Does it make sense, what you
see?
If you want to find differentially expressed miRBase microRNAs, then
you should count these reads only for DESeq. Take the miRBase file.
If you want to predict new microRNAs and do a differential expression
analysis with these, then predict microRNAs and take this file for
your DESeq analysis (or combine known with predicted).
you should find a peak somewhere around 24nt--- yes, there are two peaks, one is around 22bp, another is 32-33bp. Most 32-33bp peak (I listed), I blast against blastn, are mRNA.
In my case, for microRNA data. I have Hiseq2000, single-end 100bp sequencing.
Do some read counting fist.--I am a newbie for bioinformatic, I really want to do this step, but I do not know how to do it. I know how to collapse reads (cluster same reads and have their counts) using bwa_sam_converter.pl, but how to know reads are tRNA, snoRNA et. Using blast tools or some other softwares. I am sorry, can you list what do you do for this step?
I used mirdeep2 to get the known and novel microRNA, but the output result(.csv), there are only known miRNA counts, then how to combine known and novel microRNA together? which other softwares can do this step?
To count reads overlapping with mRNAs, try something like
bedtools intersect -abam sorted_bam_file.bam -b list_of_mRNA_exons.bed (get these from UCSC Table Browser) -u | wc -l
That returns all entries of your sorted bam file that overlap with exon of mRNAs. You need the -u option, because some entries may overlap with more then one exon and you don't want to count them twice. With wc -l you count the lines of the result from bedtools intersect.
The new bedtools version also supports BED12 format using the -split option, but I never tried it. If it works, you could also input a BED12 file of complete genes, instead of exons. But in your case, it doesn't matter, since you don't wanna quantify genes.
Thanks for your answers, I just found that in the latest version of bedtools. "-bed" option needs to be specified since the default output is binary format.
Hi
I am just curious about filtering spike-in before maping to bowtie2. I also have miRNA fastqc data to analyze and want to know how to filter spike-ins. I created a separate *.fa file listing all the spike-ins used. How should I proceed with Bowtie2 ?
e.g. spike-ins
If you are aligning to a genomic reference, then I don't think you need to filter the results per se so much as you need to annotate them appropriately. In fact, a QC plot showing the distribution of reads by gene type is probably useful.
If your institution has a license, I think you can do all of these steps in Partek.
I haven't personally tried them out, but I think these are open-source alternatives:
You can use RFAM (ftp://ftp.sanger.ac.uk/pub/databases/Rfam/CURRENT) database that contains sequences for non-coding RNAs. It contains sequences for rRNA, snRNA, snoRNA, tRNA etc. It also has a few sequences for miRNA that you would like to remove before your analysis. The Rfam.fasta.gz file contains sequences for all the organisms, so you may have to grep sequences specific to your species.
You can also try miRanalyzer. It maps the reads against Rfam and other ncRNA databases, quantifies known microRNAs and predict new microRNA candidates.
Sorry David in case I there is no miRBase microRNAs for Aspergillus fumigatus, can I use another gff3 contains cDNA, proreins and ncRNA or what should I do??
Hi everyOne,
This conversation is more helpful for me, coz the same question what i have, is discussed here. I have an another doubt, can anyone help me? How can i exclude microRNA sequences which are below 18 and above 30 N. Is there any Tools ?
Can you post this as another question? Also what you need is a length filter which has nothing to do with miRNA, and has already been covered by several posts on biostar.
you can use: miRNAkey - A software pipeline for the analysis of microRNA Deep Sequencing data
You can download a lot of annotations using the UCSC table browser.