I’m new to miRNA-seq data processing, and despite doing a lot of research and consulting with colleagues and professors, I still have many doubts about whether my workflow is correct. This uncertainty in the preprocessing step is making it difficult for me to move forward with confidence.
Currently, I’m working on a miRNA-seq experiment performed with ION TORRENT using single-end (SE) sequencing. In theory, the experiment includes 6 samples, but when I downloaded the data from SRA, I noticed there are 12 SRR instead of 6. This means that each SAMN has two SRR. I understand that, according to quality control (QC) guidelines, these two runs from the same library can be merged before analysis, but I don’t fully understand the reason behind this.
Additionally, I noticed that in the pre-trimming QC report, the sequence lengths range from 1 to 152 bp, which seems too broad. Currently, I am using Trimmomatic with the following parameters:
SLIDINGWINDOW:4:20 LEADING:20 TRAILING:20 CROP:35 MINLEN:18
In the FastQC report on overrepresented sequences, I found hits for the ABI Solid3 Adapter B, but with different sequence lengths:
- ABI Solid3 Adapter B (100% over 14bp)
- ABI Solid3 Adapter B (95% over 21bp)
- ABI Solid3 Adapter B (100% over 23bp)
Even though all these sequences match the same adapter, their base composition and length vary. This makes me uncertain about how to properly identify and remove adapters during preprocessing.
- Is it correct to use CROP in this case, or should I first remove adapters with Cutadapt and then trim low-quality sequences with Trimmomatic?
- Is it better to use SLIDINGWINDOW instead of AVGQUAL for quality filtering?
- How can I correctly determine which adapters to remove if each sample has a different overrepresented sequence?
I think I'm doing a lot of things wrong :( but I try not to get overwhelmed haha. I would really appreciate any guidance on these points, as I want to ensure proper preprocessing before moving on to further analysis.
GenoMax, thank you for your answer. You are correct; the ION-Torrent data comes from the second paper you mentioned, but it was not sequenced with Illumina.
I have searched the Thermo Fisher website and the user guide for the kit, but I could not find specific information about the adapters; they only mention adapters (P1 and A). The only clear information I have found is that the Ion Torrent suite is very effective at removing adapters during sequencing (which makes me think that, in theory, I should not be seeing adapters). However, the article does not mention whether they used this software.