Question

How does kallisto handle multi mapped reads?

1

Entering edit mode

10 months ago

bioinfo ▴ 150

Hello,

I am aligning my data with kallisto to a reference transcriptome and then assigning gene counts using tximport and biomart. I am trying to understand how kallisto/tximport handle the multi mapped reads.

Does it discard the multimapped reads or does it add fractions of counts to the transcripts?

Thank you

kallisto • 1.0k views

ADD COMMENT • link updated 10 months ago by dsull ★ 7.0k • written 10 months ago by bioinfo ▴ 150

0

Entering edit mode

I don't think tximport has anything to do with reads whatsoever. Relevant GitHub issue: https://github.com/pachterlab/kallistobustools/issues/15

ADD REPLY • link 10 months ago by Ram 44k

0

Entering edit mode

Thank you for the link. I had found this before but I am not sure if kallisto and kallisto bustools handle multi mapped reads the same way. Do you know if they do?

ADD REPLY • link 10 months ago by bioinfo ▴ 150

0

Entering edit mode

Are you using kallisto for 10X genomics single-cell RNAseq? I ask because you mentioned "bustools". If so, kallisto (w/ bustools), by default, discards all reads that map to more than one gene (this is the same approach taken by other software like Cell Ranger).

If you're using tximport, that means you're interesting in bulk RNAseq, in which case kallisto indeed does fractional count assignment performed by an EM algorithm (as ATpoint mentioned). Can go into it more if you're interested.

ADD REPLY • link 10 months ago by dsull ★ 7.0k

0

Entering edit mode

I don't think they are dealing with scRNA-seq data. I picked a slightly off-topic issue accidentally.

ADD REPLY • link 10 months ago by Ram 44k

0

Entering edit mode

Thank you so much for replying. I am using kallisto (without bustools) for bulk RNA seq. Would you mind explaining more how kallisto does the fractional count assignment?

ADD REPLY • link 10 months ago by bioinfo ▴ 150

2

Entering edit mode

Let's say you have exactly 4 reads in your dataset: All four reads map to transcript A while some of the reads also map to transcripts B and/or C.

When you run the EM algorithm, transcript A will get the most "fractional counts" while transcripts B and C will still get some (but much smaller). This is because the EM algorithm gives you probability estimates (i.e. probability of selecting a read from tx A, from tx B, from tx C, etc.), and those probability estimates (which sum up to 1) are multiplied by the number of mapped reads in your dataset. Remember, kallisto is a probabilistic algorithm -- it's doing something a bit more intelligent than simply dividing up the counts evenly. See this link for more explanation or page 16-17 of this paper.

ADD REPLY • link 10 months ago by dsull ★ 7.0k

0

Entering edit mode

dsull is an active kallisto developer and might give you a good explanation how exactly the EM algorithm within kallisto works. As for tximport, it does not know "reads". What it does is to take the transcript-level counts and then sum this to gene-level. It takes whatever the preprocessing quantifier/aligner gives it in terms of counts (and multimappers).

ADD REPLY • link 10 months ago by ATpoint 86k