Dear colleagues,
I want to share my strange experience, to ask your opinion and help.
I'm working on microRNA sequencing of a never-studied plant. I received raw data from our service company, as FASTQ file. 30 million read, 50bp length.
I already did RNAseq analysis and I was quite familiar with several tools such as fastqc, trimmomatic, bowtie2, bowtie etc...
I removed the 3' adapter provided to me by the service company. They used Illumina technology. Quality control confirmed that the adaptor sequence is right.
After adaptor trimming, I have great peaks between 19 and 39 bp (first strange thing for me...) and also some minor peaks between 39 and 50...
I downloaded the "hairpin.fa" file from MirBase, without filtering for a specific organism, changing all U in T and removing items with uncommon chars (Y,X,K etc...).
The alignment rate at this step is really low... about 3%
So I did the alignment again, this time versus the A.thaliana genome. The alignment rate increased to 65% (reads aligned only 1 time about 15%). If I launch htseq-count in order to count alignments in genome regions coding for microrna, I found 0 values for all!
I really tried everything and I don't know how to solve this problem:
- adaptor trimming with: trimmomatic, cutadapt, fastx-clipper, novoalign
- mapping with: bowtie, bowtie2 (using local parameter...), mirdeep2, novoalign
Waiting for your help and suggestions!
Thank you in advance
If you got 65% alignment to the genome, you should probably take a look at the alignments with regard to location. It sounds suspiciously like you got DNA sequences and not miRNA, but a quick look at the alignments will give you a good sense of whether that is the problem.
You can also check alignment against non-coding RNA http://www.ensembl.org/info/genome/genebuild/ncrna.html. A collaborator got better with subsequent preparations from IIRC 50% NCRNA (a lot or rRNA, tRNA) down to 10%. In 2 preparations at the beginning I could also detect a CT rich sequence that was neither the used adaptor nor any known blast result instead of the adaptor. This lead to too long sequences - similar to your case. If you really have a lot longer than 23nt, I would check also check these for common 3' sequences.
Ido thank you for your suggestion! I used miRDeep2 to align again my reads and I've found few miRNAs expressed...
I will try to filter sequences longer to 23 nts, search for common 3' sequences (I have to trim them???) and then try alignments with ncrna fasta files.
Thank for your help!
Sorry I took long to answer, but BioStars does not allow me to answer more then 5 times in less then 6 hours. Restrictions for potential spammers :)
Update: using miRDeep2 with hairpin and mature sequences for all plant species, in order to find levels of known miRNAs, gave to me some results. Few miRNAs, expressed in lot of copies!
I have read a paper where this strange alignment seems to be due to a ligation bias of illumina adapter. Do you have any idea on how to try to solve or reduce this problem?
as Ido Tamir suggested I will also try to align all discarded reads, against the RFam fasta, in order to find possible matches with other ncRNAs... what do you think?
We have adaptors with 4 random nucleotides at the 3'end. i.e. the ligation product is ADAPTORNNNNSAMPLEINSERT . Of course then you have to adjust the pipeline appropriately. The idea is that the random 4 nt should overcome ligation bias and ligate with everything. like http://www.biooscientific.com/Portals/0/White%20Papers/Randomized-Adapters-for-Reducing-Bias-in-Small-RNA-Seq-Libraries.pdf
I forgot now if both adaptors have the random sequence or just the 5' one.
I filtered all the reads with more than 23 nucleotides. Launched versus RFam and found that about 97% are ncRNA but not miRNAs! Is it normal?
how many percent are these? maybe size selection was not good?
In what sense? I used all the discarded reads from the alignment (using mirdeep) versus precursor and mature mirnas. is the 97% of all the discarded reads (about the 97% of all the initial aligned reads :P).
I don't know if I answered to your question.
I also found a strange thing inside the discarded reads. I show you an example for the most "populated" reads inside my file:
I tried to highlight common regions. There's no common sequences on 3'. It seems that one is a sub-sequence of another... These reads, for example, align with tRNAs into RFam...
Thanks a lot!!!
If you used htseq-count, then what did you use as the annotation file? Wouldn't it just make more sense to count how many non-multimappers align to each contig? Better yet, have you tried using tools like mirDeep, which are designed exactly with microRNAs in mind?
When I aligned against the A. thaliana genome, I used the annotation file provided by MirBase (gff3 file).
Against the precursors file, I desisted to continue with htseq-count, because of a too much low alignment rate.
I also tried with mirDeep2, without success! :( the alignment rate is always too much low... I was thinking that maybe is a problem of adaptor trimming... what do you think?
This is the results of alignment without trimming the adapter:
If I provide the trimmed and collapsed fast, the percentage is around 4%