After browsing through Biostars and Seqanswers, I have a general understanding of how mate inner distance is calculated which is useful for TopHat aligner. Because, to my understanding, fragment length and reads can vary, the mate inner distance will also vary.
I could just use a hard filter and set a minimum length of 200 bp but I am hoping to form a smarter solution. Because I am working with 100+ files, ideally I would like to automatically retrieve fragment and read length (excuse my terminology if incorrect) and then calculate and feed the mate inner distance value to TopHat.
I know that Bowtie (and 2?) can provide such parameters but I am curious how one can extract, calculate, input this.
Here is my command line:
perl ~/Desktop/Applications/prinseq-lite-0.20.4/prinseq-lite.pl -fastq {in_1.fastq} -fastq2 {in_2.fastq} -no_qual_header -trim_qual_left 10 -trim_qual_right 10 -trim_right 1 -custom_params "A 75%;T 75%;G 75%;C 75%" min_qual_mean 25 -min_len 40 -out_format 3 -out_good {out} -out_bad {out_singleton} parallel -j $PARALLEL_TASK --xapply -q tophat -p 10 --mate-inner-dist 200 --mate-std-dev=40 --no-coverage-search --output-dir {placeholder} --transcriptome-index /Volumes/My\ Passport/Documents/hg19/transcriptome /Volumes/My\ Passport/Documents/hg19 {1},{2},{3},{4} ::: {out_good_1} ::: {out_good_2} ::: {out_singleton_1} ::: {out_singleton_2}EDIT: Quoting Arun from here:
In order to use tophat, you'll have to have used the "-r" parameter. I guess you used -r value as 155 as the mean inner distance. However, its equally important to also provide the "--mate-std-dev".
The way I go about figuring these 2 parameters is this: I use BWA and align the paired end reads and obtain PE.sam file. From that, I use picard tools to obtain only those reads that are aligned to the reference genome. From this samfile, I calculate the inner distance of all pairs. From the vector of all such inner distances, I compute the 1st and 3rd quantiles, Q1 and Q3 and then calculate the inter quartile range IQ = Q3-Q1. From here, I take all the values that fall between Q1 - 2 * IQ to Q3 + 2 * IQ. I compute the mean and standard deviation of all these values (this is similar to what BWA does) and provide it to tophat. It seems to work great. Most of my reads are mapped under the given inner distance and standard deviation and the other (few) reads with larger inner distance. There'll always be a few reads that'll have larger inner distance and I consider them outliers and discard in the computation of mean and SD.
Hope this helps.
If I just align my PE with BWA, is there a way to automatically feed it into TopHat?
hi,
unfortunately I can't figure out how to use pipeline or like so to automatically detect the inner distance then I started alignment with tophat with default like below
if I am wrong?
in process I saw
Can I simply use
-r 200
in my case too?library type is HiSeq for Arabidopsis
thank you for any comment
The default is generally OK, you may or may not even get better results by tweaking it.