Question

High number of no feature while using htseq-count on Tophat2 aligned as well as STAR aligned paired-end data

0

Entering edit mode

8.5 years ago

candida.vaz ▴ 40

Dear all,

I have paired-end (150bp) data for the chicken Ileum from Illumina TruSeq. I was told that it is first stranded. I removed the adapters through Trimmomatic tool. I then aligned it to the Chicken Genome (Ensembl) galGal4 using STAR aligner as well as Tophat2.

I wanted to quantify the reads using htseq-count, so I sorted the bam file by name.

I used the following parameters for htseq-count :

htseq-count --format='bam' --order='name' --stranded='yes' --minaqual='10' --type='exon' --idattr='gene_id' --samout='S1-RCI001-top-out.sam' S1-RCI001-tophat.sorted.bam genes.gtf > S1-RCI001-tophat-htseq-res.txt

Total number of SAM alignments : 30059627 (when htseq finishes running) On using grep "XF:" on S1-RCI001-top-out.sam I get 56541081 (Not sure which one to use for comparison) Now, the problem is, that I am getting an exceeding amount of "no_features" and maximum of the gene counts are zero

__no_feature    26618544
__ambiguous 1751
__too_low_aQual 0
__not_aligned   0
__alignment_not_unique   3199738

Please help me identify the problem. As far as I know, I used the correct GTF file, sorted the bam file by name, and I am sure that its first stranded so used stranded="yes".

Thanks for your help!

Candida Vaz

htseq-count RNA-Seq next-gen • 5.5k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 8.5 years ago by candida.vaz ▴ 40

0

Entering edit mode

Can you post the output of the following commands?

samtools view -H S1-RCI001-tophat.sorted.bam
cut -f 1 genes.gtf | sort | uniq

My guess is that you have a mismatch in chromosome names. If the outputs are long, then just post 10 or so lines of each.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k

0

Entering edit mode

Dear Devon,

Here is the output from samtools view:

@HD VN:1.0  SO:queryname
@SQ SN:1            LN:195276750
@SQ SN:10   LN:19911089
@SQ SN:11   LN:19401079
@SQ SN:12   LN:19897011
@SQ SN:13   LN:17760035
@SQ SN:14   LN:15161805
@SQ SN:15   LN:12656803
@SQ SN:16   LN:535270
@SQ SN:17   LN:10454150
@SQ SN:18   LN:11219875

And here is the output from the genes.gtf file 1st column (just top ten lines)

ADD REPLY • link updated 8.5 years ago by Devon Ryan 104k • written 8.5 years ago by candida.vaz ▴ 40

0

Entering edit mode

OK, then michael.ante's suggestion is likely the correct one. Have a look at the sorted BAM file in IGV. That should give you a better feel for the sort of numbers that htseq-count should be producing.

ADD REPLY • link 8.5 years ago by Devon Ryan 104k

score 1 · Answer 1 · 2016-06-15

1

Entering edit mode

8.5 years ago

michael.ante ★ 3.9k

Hi Candida,

For TruSeq stranded libraries, use the Htseq-count option 'reverse'. Reads R1 are reflecting the cDNA orientation while the R2 reads are in orientation of the mRNA. If you want to be sure, use RSeQC's infer_experiment.py tool.

Cheers,

Michael

ADD COMMENT • link 8.5 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Hi Michael,

As you mentioned that for TruSeq stranded libraries, one must use the HTseq-count "reverse" option. I was wondering, what the tophat stranded option should be, I have used "fr-firststrand" while mapping with tophat2 as told by the sequence provider. So mapping with tophat2 "firststrand" option and htseq-count "reverse" is fine right? As for the STAR aligner, I guess there is no option to mention strandedness. Thanks, Candida

ADD REPLY • link 8.5 years ago by candida.vaz ▴ 40

0

Entering edit mode

Hi Candida,

I have to look that up each time. This blog explains it very good.

Cheers,

Michael

ADD REPLY • link 8.4 years ago by michael.ante ★ 3.9k

score 1 · Answer 2 · 2016-06-16

1

Entering edit mode

8.5 years ago

michael.ante ★ 3.9k

Hi Candida,

in all fields the read count reflects the number of uniquely mapping read - except the "alignment_not_unique" there each alignment is counted (if a read is mapped 5 times, this number will be increased by 5). Thus in order to compute statistics, you may include everything but the last line. For further information, please have a look at HTSeq-count's FAQ section.

Cheers,

Michael

ADD COMMENT • link 8.5 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Hi Michael,

Thanks for the information. I understand that the "alignment_not_unique" should be exempted for computing the statistics.

But to compute the statistics, as for the "yes" option results, to get the percentage of "no_feature" reads should I divide 26618544 by 30059627 (number of SAM alignment pairs processed) or by some other number? If I use 30059627, I get 88.5% no features Similarly, for the "reverse" option results, to get the percentage of "no_feature" reads if I divide 12272882 by 30059627 (number of SAM alignment pairs processed) I still get 41% "no_features".

My next question is how much percentage of "no_features" is acceptable?

Thanks for all your help, Much appreciated. Candida

ADD REPLY • link 8.5 years ago by candida.vaz ▴ 40

1

Entering edit mode

The no_feature sums up all those reads that are intronic and intergenic - according to your annotation. If you have novel exons, 5' and 3' UTR extension, or novel transcripts, these will contribute to the no_feature. On the other hand, if you have genetic contamination or a lot of retained introns in your sample, these will also contribute.

Thus, you can use the annotated SAM file from HTSeq-count, grep the no_feature reads and analyse them with e.g. the IGV. If you have sufficient clues that most of the reads are novel exons/transcripts etc., you may run cufflinks an your alignments to extend the given annotation (it's called RABT or so).

ADD REPLY • link 8.5 years ago by michael.ante ★ 3.9k

score 0 · Answer 3 · 2016-06-16

Hi Michael,

I ran htseq-count using the 'reverse' option and got these results (better than "yes" option): 30059627 SAM alignment pairs processed. __no_feature 12272882 __ambiguous 65301 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique 3199738

To get the percentage of "no-feature" and "alignment-not-unique", should I divide these numbers with this 30059627 (get this number after htseq-count finishes) or with grep "XF:" on the "samout" file 56541081

Best, Candida