Question

Counting number of genes using htseq-count tool.

0

Entering edit mode

9.5 years ago

nalandaatmi ▴ 110

I am interested in finding number of reads associated to each human genes based on the gtf file. I used the below code for running htseq-count.

htseq-count -f bam -s yes -i gene_id accepted_hits.bam genes.gtf >> SampleDH78-4_Genes_count.txt

100000 GFF lines processed.
200000 GFF lines processed.

...

869204 GFF lines processed.
Warning: Read HISEQ:137:C6W39ACXX:5:1216:16699:92182 claims to have an aligned mate which could not be found in an adjacent line.
100000 SAM alignment record pairs processed.

...

124300000 SAM alignment record pairs processed.
124400000 SAM alignment record pairs processed.
Warning: 120128572 reads with missing mate encountered.
124460592 SAM alignment pairs processed.

Sample output from htseq-count

HBA1    6323
HBA2    68466
HBB     16677
HBBP1   1
HBD     2
HBE1    0
HBEGF   0
HBG1    840
HBG2    379
HBM     0
HBP1    119
HBQ1    0

Warning: 120128572 reads with missing mate encountered from htseq-count.

% of reads associated to globin=sum of all the above reads for globin genes divided by the total number of reads * 100.

or

% of reads associated to globin=sum of all the above reads for globin genes / 124460592 * 100.

Only half of the reads pairs were processed.

RNA-Seq htseq count rna • 14k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 9.5 years ago by nalandaatmi ▴ 110

Ram · Answer 1 · 2015-09-28

0

Entering edit mode

9.5 years ago

tiago211287 ★ 1.5k

The default of HTSeq-count is for a stranded specific library, if you run the default with a unstranded library you will lose 50% of your Data. Check this and then, try again.

As in the manual:

Important: The default for strandedness is yes. If your RNA-Seq data has not been made with a strand-specific protocol, this causes half of the reads to be lost. Hence, make sure to set the option --stranded=no unless you have strand-specific data!

Also, if your data is stranded yes, check this post that can maybe explain your warning.

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 9.5 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

Thanks Tiago. I will check about the strand specific protocol with the in charge person. Will I be able to get strand details (sense and antisense) from htseq-count. In my above command, I selected for strand as yes.

Bam files are generated from Tophat. does it have to do anything with these warnings?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 9.5 years ago by nalandaatmi ▴ 110

0

Entering edit mode

After the alignment, did you process the bam files in any way? Merge | Sort?

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 9.5 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

Bam files generated by tophat are untouched. Just I provided the bam files as input to htseq.

ADD REPLY • link 9.5 years ago by nalandaatmi ▴ 110

0

Entering edit mode

If you used paired end data , bam file should be sorted by name.

samtools sort -n  in.bam out.srt

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Pinji Lei ▴ 20

0

Entering edit mode

Thanks Leipinji. Yeah I am using paired end data. As you mentioned, I am sorting my accepted_hits.bam file from tophat. After sorting, do you want me to just check HISEQ:137:C6W39ACXX:5:1216:16699:92182 in sorted file?

I have posted another question related to How to interpret the difference among these three options in strandedness from HTSeq-count

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 9.5 years ago by nalandaatmi ▴ 110

0

Entering edit mode

Yes, samtools sort by name will sort paired end reads forward and reverse together.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 9.5 years ago by Pinji Lei ▴ 20

0

Entering edit mode

Hello Leipinji, then do you know how to reunite .forward.bam and .reverse.bam, previously mapped, together to process to counting?

Many thanks

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 7.5 years ago by lucilepain • 0

0

Entering edit mode

Hi Leipinji, I came across this post while trying to figure out the problem for myself. Could you please explain why do we need to provide a bam file which is sorted by name in HTseq? Can we provide a coordinate-sorted file and then introduce the argument "-r name" in the HTseq command instead? Is that logically okay? If possible, could you help me resolve an error I am encountering in the htseq stage. I have posted up a question. Thank you so much.

ADD REPLY • link 6.1 years ago by bnayer26 • 0

0

Entering edit mode

It's suggested to use uniquely mapped reads for downstream analysis.

ADD REPLY • link updated 2.6 years ago by Ram 45k • written 9.5 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks Goutham. I have posted another question related to How to interpret the difference among these three options in strandedness from HTSeq-count

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by nalandaatmi ▴ 110

0

Entering edit mode

I think, HTS-seq only counts uniq by default, or not?

ADD REPLY • link 9.5 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

From manual:

skip all reads with alignment quality lower than the given minimum value (default: 10 - Note: the default used to be 0 until version 0.5.4.)

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by GouthamAtla 12k