I’m new to miRNA-seq data processing, and despite doing a lot of research and consulting with colleagues and professors, I still have many doubts about whether my workflow is correct. This uncertainty in the preprocessing step is making it difficult for me to move forward with confidence.
Currently, I’m working on a miRNA-seq experiment performed with ION TORRENT using single-end (SE) sequencing. In theory, the experiment includes 6 samples, but when I downloaded the data from SRA, I noticed there are 12 SRR instead of 6. This means that each SAMN has two SRR. I understand that, according to quality control (QC) guidelines, these two runs from the same library can be merged before analysis, but I don’t fully understand the reason behind this.
Additionally, I noticed that in the pre-trimming QC report, the sequence lengths range from 1 to 152 bp, which seems too broad. Currently, I am using Trimmomatic with the following parameters:
SLIDINGWINDOW:4:20 LEADING:20 TRAILING:20 CROP:35 MINLEN:18
In the FastQC report on overrepresented sequences, I found hits for the ABI Solid3 Adapter B, but with different sequence lengths:
- ABI Solid3 Adapter B (100% over 14bp)
- ABI Solid3 Adapter B (95% over 21bp)
- ABI Solid3 Adapter B (100% over 23bp)
Even though all these sequences match the same adapter, their base composition and length vary. This makes me uncertain about how to properly identify and remove adapters during preprocessing.
- Is it correct to use CROP in this case, or should I first remove adapters with Cutadapt and then trim low-quality sequences with Trimmomatic?
- Is it better to use SLIDINGWINDOW instead of AVGQUAL for quality filtering?
- How can I correctly determine which adapters to remove if each sample has a different overrepresented sequence?
I think I'm doing a lot of things wrong :( but I try not to get overwhelmed haha. I would really appreciate any guidance on these points, as I want to ensure proper preprocessing before moving on to further analysis.