Question

Higher number of reads mapped to transcriptome vs genome in star mapping

0

Entering edit mode

2 days ago

paguirreazorin • 0

I am performing a Ribosome profiling analysis, and after processing the data, I aligned the reads using STAR (version 2.7.9) with the following parameters:

STAR \
        --runThreadN 16 \
        --genomeDir ${GenomeIndex} \
        --readFilesIn ${sample} \
        --outFileNamePrefix "${output_file_prefix}" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMattrRGline ID:${root_name} \
        --sjdbGTFfile ${gtf} \
        --outFilterMismatchNoverLmax 0.1 \
        --outSAMmultNmax 1 \
        --outMultimapperOrder Random \
        --twopassMode Basic \
        --seedSearchStartLmax 15 \
        --seedSearchLmax 15 \
        --quantMode TranscriptomeSAM GeneCounts

After alignment, I performed deduplication using UMI-tools.

Unexpectedly, I found that the number of deduplicated reads aligned to the transcriptome was much higher (18,604,132 in a sample) than the number of reads aligned to the genome (7,162,360 in the same sample).

Initially, I suspected that transcript isoforms could be causing this discrepancy. However, after isoform selection, the number of deduplicated reads aligned to the transcriptome (8,539,841) was still higher than those mapped to the genome.

I would appreciate any insights on why this difference might be occurring and how to interpret these results correctly.

riboseq star bam transcriptome alignment • 182 views

ADD COMMENT • link updated 2 days ago by i.sudbery 21k • written 2 days ago by paguirreazorin • 0

0

Entering edit mode

Can you clarify what you mean by "isoform selection"?

ADD REPLY • link 2 days ago by i.sudbery 21k

0

Entering edit mode

You might also try seeing how many alignments to each before deduplication. And also looking at how many *reads" there are, rather than how many alignments - to do this, name sort, and then extract names, and do a uniq | wc -l.

ADD REPLY • link 2 days ago by i.sudbery 21k

score 0 · Answer 1 · 2025-03-20

I can think of a possibility

Consider:

You aligned reads to transcriptome, that contains two isoforms for a gene. Reads that come from the part of the transcript shared between the two isoforms will be mapped twice.

When you come to do deduplication, UMI-tools uses all reads mapped to a location to decide which reads are duplicate, and which arn't. It builds networks of reads that are related to each other by 1 edit, and then collapses the whole network to a single UMI. Thus, UMIs that differ by 2 bases can be collapsed into each other, if the intermediate UMI, that is 1 base different from each is present.

Its possible that in the transcriptome alignment, sometimes that intermediate read doesn't map to one of the transcripts, but does to the other (and the genome), so that where three reads would be collapsed to 1 on the genome, it is collapsed to 2 on the transcriptome.

My advice in general, if aligning to the transcriptome, followed by deduplication is to collapse isoforms before alignment.