How to choose parameters for kallisto single end mode?
1
0
Entering edit mode
2.1 years ago
bioinfo ▴ 150

Hello,

I have some fastq files produced by NextSeq. The average fragment length is between 350-400bp. The library prep adds 139bp of adapters, so the inserts are 200-250bp.

Would the kallisto command look like this?

kallisto quant -i index -o output --single -l 350 -s 50 R1.fastq.gz --rf-stranded

Thank you

kallisto RNA-seq • 4.0k views
ADD COMMENT
0
Entering edit mode

my hunch is that the fragment length only matters when the run is in paired-end mode - in single-end mode I don't see how that would have any effect whatsoever

ADD REPLY
0
Entering edit mode

The kallisto manual says that you need to specify both -l and -s for single end mode. I am just not sure if the values I chose are correct.

ADD REPLY
1
Entering edit mode
2.1 years ago

I made a comment on this post, but that turns out to be not quite right.

I was told via other sources that Kallisto will apply a correction based on the fragment length, basically ignoring alignments that appear to not "fit" into the transcript when the whole fragment is considered.

If so, the value should reflect the original fragment length corresponding to the biologically relevant template and not the artificial construct with ligated adapters.

ADD COMMENT
1
Entering edit mode

TPM normalized counts (which is what kallisto outputs) requires fragment length information. This is because, in TPM, counts are divided by the "effective length" rather than the actual transcript length. The effective length is the number of positions a fragment can start along a given transcript.

This is the primary reason kallisto requires fragment length information (for paired-end data, fragment length information is inferred automatically from your reads).

ADD REPLY
0
Entering edit mode

I always thought the effective length is computed based on read lengths (subtracting half the read length from each end).

As you and others pointed out, there is a second effective size correction on the 3' end based on fragment size -

I think it is a correction that could have a more substantial effect in some circumstances, for example, when the transcripts are incompletely characterized - which is very common for many organisms: we know the coding regions but not the full transcripts. In those cases, it could be better to "lie" about the fragment size and claim it to be either shorter or have a larger standard deviation - that way it would allow all the data to be used.

ADD REPLY
0
Entering edit mode

Effective length equals to transcript length minus the insert size + 1.

If transcripts don't get fragmented, then it's equal to transcript length minus read length + 1.

Reference: Pachter's 2011 Arxiv paper.

ADD REPLY
0
Entering edit mode

Thank you. Does that mean that I should use -l 225 -s 25 as parameters?

ADD REPLY
0
Entering edit mode

Yes, if you are sure that that is the insert length (the adapters should not be counted).

ADD REPLY
0
Entering edit mode

Does this mean that adapter trimming should be performed before running Kallisto single-end mode?

ADD REPLY
0
Entering edit mode

No. You, the user, supply the fragment length distribution numbers; and trimming adapters will not affect what position in a reference transcript a read gets mapped to. For a given -l and -s, you'll likely end up with the same TPMs with and without adapter trimming.

ADD REPLY
0
Entering edit mode

Thank you so much sdull! I'm still just a little confused on why you're recommending the insert size parameters (-l 225 -s 25) over the fragment size parameters (-l 350 -s 50) if it doesn't matter if the adapters are included or not. (I've read your comments like 10 times and it's just not sticking sorry)

ADD REPLY
0
Entering edit mode

Because reads can only originate from the insert, not the adapters.

If you have an adapter that's a million base pairs long, you're not going to set -l to one million right? (when the reads can only only originate from a few hundred bp's within the fragment).

(P.S. All that being said, if I'm to be honest, these fragment models [which originated from the cufflinks days] are very poor models anyway so honestly just be semi-arbitrary with it -- it probably won't make much a difference. I might write a paper on their shortcomings at some point.)

ADD REPLY
1
Entering edit mode

Additional topics to the "shortcomings" paper:

  1. What's up with the mapping quality in BAM files in general?
  2. Why would a multi-mapped read get a zero mapping quality in particular?
ADD REPLY

Login before adding your answer.

Traffic: 2189 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6