First time poster, sorry in advance for any mistakes.
I have some experience managing RNA seq data for differential expression analysis. I usually follow this pipeline: Pseudoalignment with salmon --> Import counts to R with tximport --> Differential expression analysis with DESeq2
However, I've been asked to do a similar analysis with miRNA seq data and I'm having some trouble with it. The miRNAs were extracted form human peripheral blood and were sequenced with Ilumina technology. I've created an index for salmon with the mature miRBase database and a kmer of 7, as the sequences I'm aligning are very short, and I'm obtaining mapping rates that vary from 1% to 40% for different samples. I don't know why there is such a big difference in the mapping rates. I've tried redoing the index with different kmer values, with the hairpin database instead, changing some parameters in salmon... But the results haven't imporved.
I'm very new to the miRNA world and I haven't been able to find much information about proper miRNA seq analysis. I will be very thankful for any explanation to this problem, recommendation on different tools or how to use salmon for this specific case.
Can you mention the whole workflow you followed? starting from trimming and adapter removal! We had the same problem, but it was one of the parameters in UMI-tools that gave us low reads.
The samples were sequenced by BGI genomics, and the fasq files they send you are supposedly already cleaned from any UMI or adapter sequence, so my workflow started at the alignment step. What program can I use to check if they were correctly removed? Thanks!
Okay, let's assume that the trimming was perfect. Can you try this workflow?
try it with 1 or 2 fastq files and If the problem still persists, then please ask the command they used for adapter removal.
Thank you very much for your response! I've tried the pipeline you suggested, aligning and counting to an index created with miRBase gives me very similar results to my original pipeline. Aligning to the human genome dramatically improves my mapping rate (from 5 to 40%), but again after counting with miRBase the results are very similar to the original ones. Maybe these sampels are very low quality, or have been contaminated. I don't think is a matter of adapter removal as the peak in sequence length is in 22-24bp.
Did you check the contamination with FASTQ-Screen?
This is the result I obtained from a representative sample in fastq screen
https://ibb.co/KmkkFGf
There seems to be a bit of contamination, but overall still 95% of reads are no hit