I am interested in finding number of reads associated to each human genes based on the gtf file. I used the below code for running htseq-count.
htseq-count -f bam -s yes -i gene_id accepted_hits.bam genes.gtf >> SampleDH78-4_Genes_count.txt
100000 GFF lines processed.
200000 GFF lines processed.
...
869204 GFF lines processed.
Warning: Read HISEQ:137:C6W39ACXX:5:1216:16699:92182 claims to have an aligned mate which could not be found in an adjacent line.
100000 SAM alignment record pairs processed.
...
124300000 SAM alignment record pairs processed.
124400000 SAM alignment record pairs processed.
Warning: 120128572 reads with missing mate encountered.
124460592 SAM alignment pairs processed.
Sample output from htseq-count
HBA1 6323
HBA2 68466
HBB 16677
HBBP1 1
HBD 2
HBE1 0
HBEGF 0
HBG1 840
HBG2 379
HBM 0
HBP1 119
HBQ1 0
Warning: 120128572 reads with missing mate encountered from htseq-count.
% of reads associated to globin=sum of all the above reads for globin genes divided by the total number of reads * 100.
or
% of reads associated to globin=sum of all the above reads for globin genes / 124460592 * 100.
Only half of the reads pairs were processed.
Thanks Tiago. I will check about the strand specific protocol with the in charge person. Will I be able to get strand details (sense and antisense) from htseq-count. In my above command, I selected for strand as yes.
Bam files are generated from Tophat. does it have to do anything with these warnings?
After the alignment, did you process the bam files in any way? Merge | Sort?
Bam files generated by tophat are untouched. Just I provided the bam files as input to htseq.
If you used paired end data , bam file should be sorted by name.
Thanks Leipinji. Yeah I am using paired end data. As you mentioned, I am sorting my accepted_hits.bam file from tophat. After sorting, do you want me to just check
HISEQ:137:C6W39ACXX:5:1216:16699:92182
in sorted file?I have posted another question related to How to interpret the difference among these three options in strandedness from HTSeq-count
Yes, samtools sort by name will sort paired end reads forward and reverse together.
Hello Leipinji, then do you know how to reunite .forward.bam and .reverse.bam, previously mapped, together to process to counting?
Many thanks
Hi Leipinji, I came across this post while trying to figure out the problem for myself. Could you please explain why do we need to provide a bam file which is sorted by name in HTseq? Can we provide a coordinate-sorted file and then introduce the argument "-r name" in the HTseq command instead? Is that logically okay? If possible, could you help me resolve an error I am encountering in the htseq stage. I have posted up a question. Thank you so much.
It's suggested to use uniquely mapped reads for downstream analysis.
Thanks Goutham. I have posted another question related to How to interpret the difference among these three options in strandedness from HTSeq-count
I think, HTS-seq only counts uniq by default, or not?
From manual: