Question

HTSeq-count high number of no_feature

1

Entering edit mode

10.5 years ago

cafelumiere12 ▴ 80

Hi all,

I'm trying to figure out if there's anything wrong with my htseq-count result. I started with a sam file converted from a name-sorted bam file, then use the following options:

python -m HTSeq.scripts.count -m intersection-strict -s no -r name $nsorted_aligned_sam $refGtf

Htseq-count ran fine and showed that there were:

32268229 SAM alignment pairs processed.

When I looked into the count result:

__no_feature    11327775
__ambiguous     7618389
__too_low_aQual 0
__not_aligned   844654
__alignment_not_unique  4974959

Is this too high of a number of no_feature? I checked that the gtf file has the same chr formatting as the sam file, and that the stranded option is correct. Am I missing something here?

Any idea is appreciated. Thanks in advance!

htseq-count rna-seq • 16k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by cafelumiere12 ▴ 80

0

Entering edit mode

did you check whether your sam file is sorted ?

ADD REPLY • link 10.5 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

Yes - I name sorted the bam file then converted to sam

ADD REPLY • link 10.5 years ago by cafelumiere12 ▴ 80

0

Entering edit mode

what you get if try with method=union

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Manvendra Singh ★ 2.2k

0

Entering edit mode

That's not normal, you're only able to get counts from around a quarter of the alignments. Have a look at the BAM file in IGV. See if you have lots of alignments to intergenic regions and then look at those regions in the UCSC genome browser. It could be that you just have a large amount of tRNA or repeats being expressed. Alternatively, perhaps you just have a bunch of DNA contamination. In any case, looking at the actual alignments would be a good place to start.

ADD REPLY • link 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

Got it. Thanks a lot!

ADD REPLY • link 10.5 years ago by cafelumiere12 ▴ 80

0

Entering edit mode

So I re-ran HTSeq-count with the following modifications: (1) I used another gtf file (2) -m union. Now I'm getting about 53% of the alignments mapped to feature:

__no_feature    8716076
__ambiguous     534206
__too_low_aQual 0
__not_aligned   844654
__alignment_not_unique  4974959

Is this more in the "acceptable " range?

A quick look at the counting result did show a lot less zero counts.

I still have to look at the bam file in IGV - for some reason it's not showing me anything on IGV- probably something really simple I have to change on the genome browser.

Thanks!

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by cafelumiere12 ▴ 80

0

Entering edit mode

That's certainly better. Realistically, the percentage of alignments that are to features will depend on the experiment and organism. If you're not using something like mouse/human/rat then the annotation may not be that great, so you'd expect a considerable number of __no_feature alignments. Given the large change in numbers just by using a different GTF, that may be the case here. In any case, it's always good to have a quick look at things in IGV anyway, just to ensure that there's nothing obvious you missed.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks a lot! So I checked the data on IGV. I am not sure if there is any suggested efficient way of checking the alignments. I did some point checking with the counts file I have and it seems to make sense.

The other thing that made me feel better was that I also got an alignment summary which calculates the percentage of reads mapped to intergenic and coding region. This set of samples have an average of about 30% reads mapped to intergenic region.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by cafelumiere12 ▴ 80