kallisto: strand-specific and fragment length calculation
2
1
Entering edit mode
7.6 years ago
user230613 ▴ 380

Hi!

I'm starting to use kallisto to do transcript-level expression quantification. I have some questions:

1) Does kallisto infer the strandness of the input data just like salmon does (--libType A)? I guess the answer is no.

2) For other hand, kallisto has the next to options:

--fr-stranded             Strand specific reads, first read forward
--rf-stranded             Strand specific reads, first read reverse

Are these options only working for PE data?

3) Regarding the fragment length estimation when using SE datasets:

-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: -l, -s values are estimated from paired
                               end data, but are required when using --single)

What does DOUBLE mean? Do we have to specify the double of the number calculated?

Thank you in advance

RNA-Seq kallisto • 10k views
ADD COMMENT
2
Entering edit mode
7.6 years ago
  1. No
  2. They should work for SE data too (never tried, though). You probably want --rf-stranded for anything remotely recent.
  3. An example of a double is 200.0 or 123.4. That is, any number with a decimal point. The documentation there should really be changed, since I don't expect those without C/C++/etc. programming experience to know that "double" means "double precision floating point value" (or what that even means)).
ADD COMMENT
1
Entering edit mode

Just to add to the answer, there is an option for SE data (--single).

ADD REPLY
0
Entering edit mode

Sorry, I have another question, "fragment-length" is not the same as read length, is it? I mean, it can't be inferred using input SE fastq files

ADD REPLY
0
Entering edit mode

Correct, fragment length refers to the length of the fragments loaded onto the sequencer. If this is your own dataset, then either you or whoever did the sequencing should know this (it can be estimated from a bioanalyzer plot). If this is a public dataset, then hopefully the value is written down somewhere.

ADD REPLY
0
Entering edit mode

Hello

Sorry, I am a little confused by you saying that --rf-stranded is most likely the most appropriate option. For SE data, wouldn't you want to only process reads that align to the forward strand of the transcript?

Or have I made an error here?

ADD REPLY
0
Entering edit mode

It doesn't matter whether you sequence SE or PE, read #1 in a pair aligns with the opposite orientation of the originating fragment for recent (since ~2013) data. In a parlance that many prefer, read #1 should align to the opposite strand of the transcript/gene.

ADD REPLY
0
Entering edit mode

by originating fragment, do you mean the transcriptome or genome sequences?

ADD REPLY
1
Entering edit mode

Either way. If you align to the transcriptome then read #2 should always be aligned as its reverse complement.

ADD REPLY
0
Entering edit mode

One more question.

RSEQC package outputting "1+-,1-+,2++,2--" , basically means that read#2 'set' the strand, since aligns in the same strand of the transcript/gene. Thus, read #1 aligns to the opposite strand of the transcript/gene (i.e. reverse-complemented).

For this library type (apparently the most common nowadays), parameter --rf-stranded should be the one to use in 'kallisto quant' for abundance estimation using a reference transcriptome. Is that right?

The link below has confused me in this respect, and just wanted to be sure:

https://github.com/griffithlab/rnaseq_tutorial/blob/master/manuscript/supplementary_tables/supplementary_table_5.md

ADD REPLY
0
Entering edit mode

Correct, TruSeq is the most common and it's --rf-stranded (if that's wrong, you'll be able to tell from the terrible quantitation metrics).

ADD REPLY
2
Entering edit mode
7.5 years ago
pmelsted ▴ 130
  1. No, if you are unsure I would recommend blatting a few reads to see which strand they map to. Also if you happen to choose the wrong version you'll have significantly fewer reads mapping.

  2. This works for SE and PE data.

  3. As pointed out by Devon, this means that it accepts a floating point value or an integer. The -l and -s parameters are required for SE data and refer to the fragment length distributions, for PE data they can be estimated from the paired reads. Typical values for RNA-Seq are -l 200 and -s 30.

ADD COMMENT

Login before adding your answer.

Traffic: 1665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6