Question

Tophat2 and Bowtie2 : determining appropriate values for mate-inner-dist and mate-std-dev

0

Entering edit mode

8.1 years ago

irritable_phd_syndrome ▴ 130

I am trying to tune my tophat2 runs following the example of this blog. They suggest trying different values for the --mate-inner-dist and --mate-std-dev.

To get an idea as to the numbers I should use, I ran tophat using :

tophat2 -p 20 -o out/ -a 8 --library-type fr-firststrand --no-coverage-search /path/to/Bowtie2Index/Mus_musculus.GRCm38 R1.fastq.gz R2.fastq.gz

Looking at the bam alignment file, I selected all the Read1 reads (SAM FLAG bit 0x40 set) which were properly aligned (bit 0x2 set) and were not secondary alignments (bit 0x100 _not_ set). The mean of abs(TLEN) is 3400 and the standard deviation of abs(TLEN) is 369034. Double checking the file, there indeed large values for TLEN (e.g. there are values in the 100,000s and millions).

Picking these as my values for the mate-inner-dist and mate-std-dev seems like a very poor decision. Clearly there is a tail of reads with very large template lengths that are skewing the distribution (which is _not_ Gaussian). So I have a few of questions:

Is the template length (TLEN) from tophat computed from the genome or transcriptome?
What would the best way to select the mate-inner-dist?
Looking at the run.log it appears that bowtie2 uses the default value for --maxins=500 (ie. it is not explicitly changed). Does this matter since I have values of TLEN > 500?

RNA-Seq alignment • 2.3k views

ADD COMMENT • link updated 8.1 years ago by mra8187 ▴ 20 • written 8.1 years ago by irritable_phd_syndrome ▴ 130

0

Entering edit mode

It may be better to switch to HISAT2 (if you want to stay in the same family) at this time.

You can estimate the insert size using BBMap as described here: Estimating Mean Inner Distance

ADD REPLY • link 8.1 years ago by GenoMax 148k

0

Entering edit mode

The "best" way would be to look at the Bioanalyzer trace, or whatever your sequencing people used, when checking library size distribution after library prep. The fragment length mean and SD can be estimated from that curve, after subtracting the adapter lengths.

ADD REPLY • link 8.0 years ago by apa@stowers ▴ 610

score 0 · Answer 1 · 2017-01-03

you can find out insert size and STD with using BWA :

./bwa index -a bwtsw /path to reference genome

./bwa index /path to reference genome

./bwa aln -o 0 -e 0 /path to bwa indexed reference genome /path to trimmed.file1.fastq.gz > trimmed.file1.fastq.gz"_to_"reference genome.sai

./bwa aln -o 0 -e 0 /path to bwa indexed reference genome /path to trimmed.file2.fastq.gz > trimmed.file2.fastq.gz"_to_"reference genome.sai

./bwa sampe /path to bwa indexed reference genome /path to trimmed.file1.fastq.g "_to_"reference genome.sai /path to trimmed.file2.fastq.gz"_to_"reference genome.sai /path to trimmed.file1.fastq.gz /path to trimmed.file2.fastq.gz > reads_pair_to_ref.sam