Before mapping RNAseq reads, TopHat always perform a quality filtering step on the reads when preparing them. I would like to know what the cut-off is for discarding a read? It's difficult to find information about this since most posts about quality and tophat/bowtie relates to mapping quality (naturally) and not read quality.
Cannot find much in the TopHat manual but in the Bowtie manual it says:
Some reads are skipped or "filtered out" by Bowtie 2. For example, reads may be filtered out because they are extremely short or have a high proportion of ambiguous nucleotides. Bowtie 2 will still print a SAM record for such a read, but no alignment will be reported and and the
YF:i
SAM optional field will be set to indicate the reason the read was filtered.
YF:Z:LN
: the read was filtered becuase it had length less than or equal to the number of seed mismatches set with the-N
option.YF:Z:NS
: the read was filtered because it contains a number of ambiguous characters (usuallyN
or.
) greater than the ceiling specified with--n-ceil
.YF:Z:SC
: the read was filtered because the read length and the match bonus (set with--ma
) are such that the read can't possibly earn an alignment score greater than or equal to the threshold set with--score-min
YF:Z:QC
: the read was filtered because it was marked as failing quality control and the user specified the--qc-filter
option. This only happens when the input is in Illumina's QSEQ format (i.e. when--qseq
is specified) and the last (11th) field of the read's QSEQ record contains1
.
My read length is definitely not shorter than the seed mismatches so this first option can be ruled out.
Regarding the second option about ambiguous characters, this is what --n-ceil
says:
Sets a function governing the maximum number of ambiguous characters (usually
N
s and/or.
s) allowed in a read as a function of read length. For instance, specifying-L,0,0.15
sets the N-ceiling functionf
tof(x) = 0 + 0.15 * x
, where x is the read length. See also: [setting function options]. Reads exceeding this ceiling are [filtered out]. Default:L,0,0.15
.
So default read length * 0.15
. Pretty straightforward. No questions here.
The third option regard --ma
I would assume does not apply to the default settings since the default setting is to use end-to-end alignment? --ma
always equals 0 in this default mode.
The fourth option applies to users specifying the --qseq
options so for someone like me who uses fastq, I would assume it's not relevant.
Does that mean with default settings, bowtie/tophat only takes into consideration ambiguous characters? What about fastq base quality/read quality, is this not taken into consideration when filtering?
Appreciate any help and thoughts.