Higher number of reads mapped to transcriptome vs genome in star mapping
1
0
Entering edit mode
1 day ago

I am performing a Ribosome profiling analysis, and after processing the data, I aligned the reads using STAR (version 2.7.9) with the following parameters:

STAR \
        --runThreadN 16 \
        --genomeDir ${GenomeIndex} \
        --readFilesIn ${sample} \
        --outFileNamePrefix "${output_file_prefix}" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMattrRGline ID:${root_name} \
        --sjdbGTFfile ${gtf} \
        --outFilterMismatchNoverLmax 0.1 \
        --outSAMmultNmax 1 \
        --outMultimapperOrder Random \
        --twopassMode Basic \
        --seedSearchStartLmax 15 \
        --seedSearchLmax 15 \
        --quantMode TranscriptomeSAM GeneCounts

After alignment, I performed deduplication using UMI-tools.

Unexpectedly, I found that the number of deduplicated reads aligned to the transcriptome was much higher (18,604,132 in a sample) than the number of reads aligned to the genome (7,162,360 in the same sample).

Initially, I suspected that transcript isoforms could be causing this discrepancy. However, after isoform selection, the number of deduplicated reads aligned to the transcriptome (8,539,841) was still higher than those mapped to the genome.

I would appreciate any insights on why this difference might be occurring and how to interpret these results correctly.

riboseq star bam transcriptome alignment • 162 views
ADD COMMENT
0
Entering edit mode

Can you clarify what you mean by "isoform selection"?

ADD REPLY
0
Entering edit mode

You might also try seeing how many alignments to each before deduplication. And also looking at how many *reads" there are, rather than how many alignments - to do this, name sort, and then extract names, and do a uniq | wc -l.

ADD REPLY
0
Entering edit mode
1 day ago

I can think of a possibility

Consider:

You aligned reads to transcriptome, that contains two isoforms for a gene. Reads that come from the part of the transcript shared between the two isoforms will be mapped twice.

When you come to do deduplication, UMI-tools uses all reads mapped to a location to decide which reads are duplicate, and which arn't. It builds networks of reads that are related to each other by 1 edit, and then collapses the whole network to a single UMI. Thus, UMIs that differ by 2 bases can be collapsed into each other, if the intermediate UMI, that is 1 base different from each is present.

Its possible that in the transcriptome alignment, sometimes that intermediate read doesn't map to one of the transcripts, but does to the other (and the genome), so that where three reads would be collapsed to 1 on the genome, it is collapsed to 2 on the transcriptome.

My advice in general, if aligning to the transcriptome, followed by deduplication is to collapse isoforms before alignment.

ADD COMMENT

Login before adding your answer.

Traffic: 1183 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6