Question

HTSeq funny output

1

Entering edit mode

7.1 years ago

--panda-- ▴ 30

I got some interesting output from the htseq-count.

 My alignment file have 9933765 reads

 My htseq.count have 60525122 none unique alignment

I wonder if someone can have a reasonable explanation that "none unique alignments" are more than my input reads. If not, how can you persuade yourself when you use htseq( eg. as an input for edgeR or DESeq2)?

I put the pipline, tophat summary as well as the htseq-count output below.

Command Line:

  ~/Documents/sratoolkit.2.7.0-ubuntu64/bin/fastq-dump -I --split-files -O ./fastq WTrep1_SRR2547503.sra

  java -jar ~/Documents/Trimmomatic-0.36/trimmomatic-0.36.jar SE -phred33 ./fastq/WTrep1_SRR2547503_1.fastq ./fastq/WTrep1_trim.fastq   ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

  ~/Documents/bowtie2-2.3.3.1/bowtie2-build ./genome/GCF_000005845.2_ASM584v2_genomic.fna e_coli

  ~/Documents/tophat-2.1.1.Linux_x86_64/tophat --no-coverage-search -o WTrep1 e_coli  ./fastq/WTrep1_trim.fastq

  htseq-count -f bam -i gene -t gene  WTrep1/accepted_hits.bam  ./genome/GCF_000005845.2_ASM584v2_genomic.gff > ./WTrep1.txt

Tophat alignment summary:

Reads:
      Input     :  10820715
      Mapped   :   9933765 (91.8% of input)
      of these:   9145766 (92.1%) have multiple alignments (675 have >20)
 91.8% overall read mapping rate.

Last few line of htseq-count output (using HTSeq (0.9.1)):

      __no_feature  760545
      __ambiguous   380
      __too_low_aQual   0
      __not_aligned 0
      __alignment_not_unique    60525122

RNA-Seq quantification htseq • 2.4k views

ADD COMMENT • link 7.1 years ago by --panda-- ▴ 30

score 0 · Answer 1 · 2017-11-03

0

Entering edit mode

7.1 years ago

WouterDeCoster 47k

Your fastq has 10820715 reads, of which 9933765 were mapped by tophat (see also C: Tophat: Can I forbid splice junction? )

BUT 92% of your reads were aligned multiple times, resulting in 60525122 alignments which are not unique -> multimapping reads.

Nothing wrong here.

ADD COMMENT • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

as one of the reasons, could be due to low complexity sequences in fastq file ?

ADD REPLY • link 7.1 years ago by Sam ▴ 150

0

Entering edit mode

Absolutely, for RNA-seq reads derived from rRNA is a likely suspect.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

As you may know, tophat only report the best alignment(it also applies to Bowtie2 or HISAT).

tophat2 manual

--report-secondary-alignments
By default TopHat reports best or primary alignments based on alignment scores (AS). Use this option if you want to output additional or secondary alignments (up to 20 alignments will be reported this way, this limit can be changed by using the -g/--max-multihits option above).

So although 92% reads aligned multiple times, htseq has no information about where are the alignments locate except for the best one. If I got 100 reads aligned 200 times, it should say "100 reads has none unique alignments" rather than "20000 not_unique alignments"

ADD REPLY • link 7.1 years ago by --panda-- ▴ 30

0

Entering edit mode

As you may know, tophat reports up to 20 alignments per read by default:

-g/--max-multihits <int> Instructs TopHat to allow up to this many alignments to the reference for a given read, and choose the alignments based on their alignment scores if there are more than this number. The default is 20 for read mapping. Unless you use --report-secondary-alignments, TopHat will report the alignments with the best alignment score. If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments. In case of using --report-secondary-alignments, TopHat will try to report alignments up to this option value, and TopHat may randomly output some of the alignments with the same score to meet this number.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry if you felt a bit uncomfortable, I did not mean that. I was just confused.

So if tophat found 25 alignments for one read:

 situation1:   1 has highest score(let's say 0) and the other 24 has lower score(let's say -10), only the one alignment is reported. 

 situation2:   10 has the same highest score(0) and the other 14 has lower score(-10), the first 10 alignments are reported. 

 situation3:   25 has the same highest score(0), random choose 20 out of the 25 alignment and report them.

 situation4:   if  use --report-secondary-alignments and 10 alignments has the same highest score(0) and the other 14 has lower score(-10),  the 10 highest score alignments are reported and 10 out of the 14 lower score reads are reported.

Am I correct? Thanks for you reply and patience

ADD REPLY • link 7.1 years ago by --panda-- ▴ 30

0

Entering edit mode

I agree that it's confusing - don't worry. I didn't mean to be snarky.

Intuitively I would agree with your statements - although I haven't used Tophat in a long time (and you shouldn't use it too), so I'm not too sure. Perhaps someone else can chime in.

ADD REPLY • link 7.1 years ago by WouterDeCoster 47k

0

Entering edit mode

I agree that using HISAT may provide higher efficiency but I think there is no fundamental improvement in terms of precision.

Thanks!

ADD REPLY • link 7.1 years ago by --panda-- ▴ 30