I am performing a Ribosome profiling analysis, and after processing the data, I aligned the reads using STAR (version 2.7.9) with the following parameters:
STAR \
--runThreadN 16 \
--genomeDir ${GenomeIndex} \
--readFilesIn ${sample} \
--outFileNamePrefix "${output_file_prefix}" \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:${root_name} \
--sjdbGTFfile ${gtf} \
--outFilterMismatchNoverLmax 0.1 \
--outSAMmultNmax 1 \
--outMultimapperOrder Random \
--twopassMode Basic \
--seedSearchStartLmax 15 \
--seedSearchLmax 15 \
--quantMode TranscriptomeSAM GeneCounts
After alignment, I performed deduplication using UMI-tools.
Unexpectedly, I found that the number of deduplicated reads aligned to the transcriptome was much higher (18,604,132 in a sample) than the number of reads aligned to the genome (7,162,360 in the same sample).
Initially, I suspected that transcript isoforms could be causing this discrepancy. However, after isoform selection, the number of deduplicated reads aligned to the transcriptome (8,539,841) was still higher than those mapped to the genome.
I would appreciate any insights on why this difference might be occurring and how to interpret these results correctly.
Can you clarify what you mean by "isoform selection"?
You might also try seeing how many alignments to each before deduplication. And also looking at how many *reads" there are, rather than how many alignments - to do this, name sort, and then extract names, and do a
uniq | wc -l
.