I got some interesting output from the htseq-count.
My alignment file have 9933765 reads
My htseq.count have 60525122 none unique alignment
I wonder if someone can have a reasonable explanation that "none unique alignments" are more than my input reads. If not, how can you persuade yourself when you use htseq( eg. as an input for edgeR or DESeq2)?
I put the pipline, tophat summary as well as the htseq-count output below.
Command Line:
~/Documents/sratoolkit.2.7.0-ubuntu64/bin/fastq-dump -I --split-files -O ./fastq WTrep1_SRR2547503.sra
java -jar ~/Documents/Trimmomatic-0.36/trimmomatic-0.36.jar SE -phred33 ./fastq/WTrep1_SRR2547503_1.fastq ./fastq/WTrep1_trim.fastq ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
~/Documents/bowtie2-2.3.3.1/bowtie2-build ./genome/GCF_000005845.2_ASM584v2_genomic.fna e_coli
~/Documents/tophat-2.1.1.Linux_x86_64/tophat --no-coverage-search -o WTrep1 e_coli ./fastq/WTrep1_trim.fastq
htseq-count -f bam -i gene -t gene WTrep1/accepted_hits.bam ./genome/GCF_000005845.2_ASM584v2_genomic.gff > ./WTrep1.txt
Tophat alignment summary:
Reads:
Input : 10820715
Mapped : 9933765 (91.8% of input)
of these: 9145766 (92.1%) have multiple alignments (675 have >20)
91.8% overall read mapping rate.
Last few line of htseq-count output (using HTSeq (0.9.1)):
__no_feature 760545
__ambiguous 380
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 60525122
as one of the reasons, could be due to low complexity sequences in fastq file ?
Absolutely, for RNA-seq reads derived from rRNA is a likely suspect.
As you may know, tophat only report the best alignment(it also applies to Bowtie2 or HISAT).
tophat2 manual
So although 92% reads aligned multiple times, htseq has no information about where are the alignments locate except for the best one. If I got 100 reads aligned 200 times, it should say "100 reads has none unique alignments" rather than "20000 not_unique alignments"
As you may know, tophat reports up to 20 alignments per read by default:
Sorry if you felt a bit uncomfortable, I did not mean that. I was just confused.
So if tophat found 25 alignments for one read:
Am I correct? Thanks for you reply and patience
I agree that it's confusing - don't worry. I didn't mean to be snarky.
Intuitively I would agree with your statements - although I haven't used Tophat in a long time (and you shouldn't use it too), so I'm not too sure. Perhaps someone else can chime in.
I agree that using HISAT may provide higher efficiency but I think there is no fundamental improvement in terms of precision.
Thanks!