Entering edit mode
8.0 years ago
zz105
▴
20
I am doing paired-end RNA-seq analysis, and I noticed that there are 2 files have different # of reads showing in fastqc and tophat summary. There two files also had problems running htseq-count. Does anyone know what is wrong?
Here is align_summary from tophat:
Left reads:
Input : 6833369
Mapped : 6205653 (90.8% of input)
of these: 373370 ( 6.0%) have multiple alignments (2442 have >20)
Right reads:
Input : 6833369
Mapped : 6199565 (90.7% of input)
of these: 373213 ( 6.0%) have multiple alignments (2441 have >20)
and the summary from fastqc:
##FastQC 0.10.1
>>Basic Statistics pass
#Measure Value
Filename SRR1559100_1.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 88383287
Filtered Sequences 0
Sequence length 60-101
%GC 48
Were the reads trimmed at some point? Tophat filters out reads that it considers to be too short, so they won't end up in its final output.
They were trimmed using trimmomatics before running both fastqc and tophat, and reads were dropped if their length were shorter than 60% of the read length
Also, if the reads contain Ns, they might not be counted (not sure, but is a suggestion).
I counted by grep -c 'N' file and the number does not seem to be close to the difference, but thanks for the suggesiton
Not sure this is the right way to do it. You might grep read names or quality lines as well this way. You should use for example
bioawk
to print only $seq lines and thengrep -c 'N'
from that output.