Tophat2 and Bowtie2 : determining appropriate values for mate-inner-dist and mate-std-dev
1
0
Entering edit mode
8.0 years ago

I am trying to tune my tophat2 runs following the example of this blog. They suggest trying different values for the --mate-inner-dist and --mate-std-dev.

To get an idea as to the numbers I should use, I ran tophat using :

tophat2 -p 20 -o out/ -a 8 --library-type fr-firststrand --no-coverage-search /path/to/Bowtie2Index/Mus_musculus.GRCm38 R1.fastq.gz R2.fastq.gz

Looking at the bam alignment file, I selected all the Read1 reads (SAM FLAG bit 0x40 set) which were properly aligned (bit 0x2 set) and were not secondary alignments (bit 0x100 _not_ set). The mean of abs(TLEN) is 3400 and the standard deviation of abs(TLEN) is 369034. Double checking the file, there indeed large values for TLEN (e.g. there are values in the 100,000s and millions).

Picking these as my values for the mate-inner-dist and mate-std-dev seems like a very poor decision. Clearly there is a tail of reads with very large template lengths that are skewing the distribution (which is _not_ Gaussian). So I have a few of questions:

  1. Is the template length (TLEN) from tophat computed from the genome or transcriptome?
  2. What would the best way to select the mate-inner-dist?
  3. Looking at the run.log it appears that bowtie2 uses the default value for --maxins=500 (ie. it is not explicitly changed). Does this matter since I have values of TLEN > 500?
RNA-Seq alignment • 2.3k views
ADD COMMENT
0
Entering edit mode

It may be better to switch to HISAT2 (if you want to stay in the same family) at this time.

You can estimate the insert size using BBMap as described here: Estimating Mean Inner Distance

ADD REPLY
0
Entering edit mode

The "best" way would be to look at the Bioanalyzer trace, or whatever your sequencing people used, when checking library size distribution after library prep. The fragment length mean and SD can be estimated from that curve, after subtracting the adapter lengths.

ADD REPLY
0
Entering edit mode
7.9 years ago
mra8187 ▴ 20

you can find out insert size and STD with using BWA :

./bwa index -a bwtsw /path to reference genome

./bwa index /path to reference genome

./bwa aln -o 0 -e 0 /path to bwa indexed reference genome /path to trimmed.file1.fastq.gz > trimmed.file1.fastq.gz"_to_"reference genome.sai

./bwa aln -o 0 -e 0 /path to bwa indexed reference genome /path to trimmed.file2.fastq.gz > trimmed.file2.fastq.gz"_to_"reference genome.sai

./bwa sampe /path to bwa indexed reference genome /path to trimmed.file1.fastq.g "_to_"reference genome.sai /path to trimmed.file2.fastq.gz"_to_"reference genome.sai /path to trimmed.file1.fastq.gz /path to trimmed.file2.fastq.gz > reads_pair_to_ref.sam

ADD COMMENT

Login before adding your answer.

Traffic: 1260 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6