I am trying to tune my tophat2 runs following the example of this blog. They suggest trying different values for the --mate-inner-dist
and --mate-std-dev
.
To get an idea as to the numbers I should use, I ran tophat using :
tophat2 -p 20 -o out/ -a 8 --library-type fr-firststrand --no-coverage-search /path/to/Bowtie2Index/Mus_musculus.GRCm38 R1.fastq.gz R2.fastq.gz
Looking at the bam alignment file, I selected all the Read1 reads (SAM FLAG bit 0x40 set) which were properly aligned (bit 0x2 set) and were not secondary alignments (bit 0x100 _not_ set). The mean of abs(TLEN) is 3400 and the standard deviation of abs(TLEN) is 369034. Double checking the file, there indeed large values for TLEN (e.g. there are values in the 100,000s and millions).
Picking these as my values for the mate-inner-dist
and mate-std-dev
seems like a very poor decision. Clearly there is a tail of reads with very large template lengths that are skewing the distribution (which is _not_ Gaussian). So I have a few of questions:
- Is the template length (TLEN) from tophat computed from the genome or transcriptome?
- What would the best way to select the mate-inner-dist?
- Looking at the
run.log
it appears that bowtie2 uses the default value for --maxins=500 (ie. it is not explicitly changed). Does this matter since I have values of TLEN > 500?
It may be better to switch to HISAT2 (if you want to stay in the same family) at this time.
You can estimate the insert size using BBMap as described here: Estimating Mean Inner Distance
The "best" way would be to look at the Bioanalyzer trace, or whatever your sequencing people used, when checking library size distribution after library prep. The fragment length mean and SD can be estimated from that curve, after subtracting the adapter lengths.