Question

reads diffreence in fastqc and Tophat input

1

Entering edit mode

8.5 years ago

zz105 ▴ 20

I am doing paired-end RNA-seq analysis, and I noticed that there are 2 files have different # of reads showing in fastqc and tophat summary. There two files also had problems running htseq-count. Does anyone know what is wrong?

Here is align_summary from tophat:

Left reads:
          Input     :   6833369
           Mapped   :   6205653 (90.8% of input)
            of these:    373370 ( 6.0%) have multiple alignments (2442 have >20)
Right reads:
          Input     :   6833369
           Mapped   :   6199565 (90.7% of input)
            of these:    373213 ( 6.0%) have multiple alignments (2441 have >20)

and the summary from fastqc:

 ##FastQC   0.10.1
>>Basic Statistics  pass
#Measure    Value   
Filename    SRR1559100_1.fastq.gz   
File type   Conventional base calls 
Encoding    Sanger / Illumina 1.9   
Total Sequences 88383287    
Filtered Sequences  0   
Sequence length 60-101  
%GC 48

RNA-Seq alignment sequence software error • 1.8k views

ADD COMMENT • link updated 8.4 years ago by Biostar 20 • written 8.5 years ago by zz105 ▴ 20

2

Entering edit mode

Were the reads trimmed at some point? Tophat filters out reads that it considers to be too short, so they won't end up in its final output.

ADD REPLY • link 8.5 years ago by Devon Ryan 105k

0

Entering edit mode

They were trimmed using trimmomatics before running both fastqc and tophat, and reads were dropped if their length were shorter than 60% of the read length

ADD REPLY • link 8.5 years ago by zz105 ▴ 20

1

Entering edit mode

Also, if the reads contain Ns, they might not be counted (not sure, but is a suggestion).

ADD REPLY • link 8.5 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

I counted by grep -c 'N' file and the number does not seem to be close to the difference, but thanks for the suggesiton

ADD REPLY • link 8.5 years ago by zz105 ▴ 20

1

Entering edit mode

Not sure this is the right way to do it. You might grep read names or quality lines as well this way. You should use for example bioawk to print only $seq lines and then grep -c 'N' from that output.

ADD REPLY • link 8.5 years ago by Matteo Schiavinato ★ 3.7k