Question

Kallisto read count processed does not tally

0

Entering edit mode

6.8 years ago

solyris83 ▴ 20

Hi,

I am running Kallisto quant with the following command

kallisto quant -i gencodeV27 -o sample1 --bias --rf-stranded --genomebam --gtf gencode.v27lift37.annotation.gtf --chromosomes chromSize.txt sample1_1.fq sample1_2.fq

And one particular summary line below caught my attention which says (correct me if I am wrong) that there is a total of 71,414,042 paired reads in my fastq files

[quant] processed 71,414,042 reads, 65,645,163 reads pseudoaligned

The processed reads number does not tally against the reads count I get from trim_galore. I am using trimmed fastq files for this testing and the summary reported number from trim_galore after processing is found below, which means that I have only 65,853,559 (= 66089021 - 235462) reads in my input fastq file for kallisto.

RUN STATISTICS FOR INPUT FILE: sample1.fastq.gz
=============================================
66089021 sequences processed in total

Total number of sequences analysed for the sequence pair length validation: 66089021

Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 235462 (0.36%)

Any comment on my interpretation above would be much appreciated. Or TLDR, why is Kallisto processing more reads than what is in the input fastq?

Kallisto trim_galore • 3.9k views

ADD COMMENT • link 6.8 years ago by solyris83 ▴ 20

0

Entering edit mode

The nomenclature is often imprecise.

One of your tools seems to report pairs, the other reads. A pair contains two reads.

ADD REPLY • link 6.8 years ago by Istvan Albert 101k

0

Entering edit mode

Try checking the read counts in your file(s) more directly.

~~grep -e "^@" sample1_1.fq | wc -l~~ (prone to error)

 awk '{s++}END{print s/4}' sample1_1.fq

or for the zipped:

zcat sample1.fastq.gz | awk '{s++}END{print s/4}'

In all my runs, kallisto gives the read count in single-end mode and the pair count in paired-end mode.

ADD REPLY • link 6.8 years ago by mmfansler ▴ 460

1

Entering edit mode

Isn't it dangerous to check the read counts in FASTQ with "^@", given than quality lines may also start with "@" ?

ADD REPLY • link 6.8 years ago by erwan.scaon ▴ 950

0

Entering edit mode

Noted. I fixed it to recommend a less error-prone way. Main point is, though, that the questioner needs to verify the files more directly, because I can't replicate his discrepancy.

ADD REPLY • link 6.8 years ago by mmfansler ▴ 460

score 1 · Answer 1 · 2018-02-21

Hi, as recommended by mmfansler, I did a check on the raw fastq files with the shell statement, and as erwan rightly pointed out a lot more "reads" are counted which are basically sequence quality with first base being @.

The read counting from awk '{s++}END{print s/4}' sample1_1.fq tallies to the result from kallisto now.

This is quite surprising as the sequencing service provider actually provided a reads number generated for each of my sample and this read count tallies exactly with the trim galore number before trimming. And this reported number (~60 million reads pair) from service provider and trim galore is actually smaller than the counted number of lines from the fastq file (~70million reads for both fastq in the pair).

Maybe I need to dig a bit deeper into what filter has been applied before the number is reported from the service provider and maybe trim galore.

Thanks a lot for the help!

score 0 · Answer 2 · 2018-02-22

0

Entering edit mode

6.8 years ago

solyris83 ▴ 20

Hi mmfansler,

Reads are 100bp, I tried counting on the original reads below

Read 1 match on "^@" : 75,681,793 Read 1 count lines divide by 4 : 71,552,293

Read 2 match on "^@" : 77,581,579 Read 2 count lines divide by 4 : 71,552,293

ADD COMMENT • link 6.8 years ago by solyris83 ▴ 20