Dear All,
Can some one help me with the following? Are there any pipelines that one would like to suggest on the other hand that I can refer to?
Using: Tophat2, Bowtie2
- I have multiplexed single end sequencing data with 40bp reads and I would like to perform differential analysis on this.
- I am using the index files from Tophat index & annotation downloads for UCSC mouse mm10
I use the following Tophat command to do the alignment
tophat -p 4 -N 3 --no-coverage-search --read-edit-dist 3 --output-dir <path> /Bowtie2Index/genome <input_path>
I would like to know how can I improve the alignment using tophat. I seem to have the following statistics on alignment
`3208092 reads; of these:`
3208092 (100.00%) were unpaired; of these:
724883 (22.60%) aligned 0 times
1845395 (57.52%) aligned exactly 1 time
637814 (19.88%) aligned >1 times
77.40% overall alignment rate`
This averall alignment percentage varies from 75% to 80% for different samples. Is this normal? I only wonder how to account for this 20% - 25% of reads. Increasing the number of mismatches could be one option but any suggestions on that?
Note: I also checked for alignment against phix controls using Bowtie, which is less than 0.15%(in all samples).
Can you also comment on the number of reads (depth) that is required for a good DE gene analysis for a single sample library (say with 3 bio-replicates and 2 tech replicates for each condition) or size of each sample library?.
Anything else, that you think I am missing out or consider?
Thank you very much!
Thanks a lot!!
An yeah... Sorry it is 2 tech replicates!! My mistake!! Can I extend this question further: I used HTSeq on the TopHat output (bam to sam) for counting features... I had the following kind of stats!
no_feature 782850 ambiguous 5151 too_low_aQual 0 not_aligned 0 alignment_not_unique 2153777
there were 4.1 M features in total... The total number of reads aligned to one of the genes is about 1.2 Million. About 2 million reads not being aligned uniquely is disturbing.
What do you think about this? Does this look normal too?
For comparisons sake, I looked at a dataset that I aligned with STAR. Of the ~36 million (trimmed) reads for one sample, 31.6 million aligned uniquely. The equivalent numbers from HTSeq-count are: no_feature 3160988, ambiguous 526705, alignment_not_unique 5838118.
BTW, by 4.1M features, do you mean 4.1 million reads or that there were 4.1 million features in the annotation file that you fed to htseq-count?
So it looks STAR gives better alignment than TopHat from your numbers!! I shall try STAR then...
4.1M is the total number in HTSeq output (no feature+ ambiguous+too_low_qQual+not_aligned+GeneCountsReported)