I am using htseq-count on BAM files from a bacterial species. We are comparing WT strains as well as two strains with genes knocked out. The knockouts have been verified with whole genome sequencing, but do retain the first 15 and last 18 bp of the ORF. I used paired-end Illumina data to align. When using htseq-count, it records counts for these deleted genes. One gene has counts of ~95 (compared to ~6000 for WT) while the other gene has counts of ~28 (compared to ~160 for WT).
Viewing the alignments with BigWig files I see the majority of the gene is lacking all coverage, but the ends of the genes seem have reads that overlap slightly with the gene. I believe they are overlapping with the small amount of the ORF remaining and down/upstream sequence still present in the deletion strains. Previously, when using single-end data, htseq-count did not record any reads for the same deletion strains.
Because of the sequencing analysis and the RNA-seq alignment analysis, I know these genes are not present in the strains, and so the count results are misleading. We can manually adjust the count data to 0 but I am curious to know why we are seeing these count values for deleted genes.
Does anyone have a suggestion as to why we are seeing misleading gene counts with htseq-count for these two deleted genes using paired-end RNA-seq data? It seems misleading and any help on how to fix it would be appreciated.
Command used: htseq-count -f bam -s yes <bamfile> <gff>
HTSeq
- version 2.0.3; Python
- version 3.10.12; operating system - CentOS 7
; PE reads aligned with BWA
version 0.7.17-r1188
Isn't that why?
Yeah, I think that's the case. I was just wondering if this was something other people had seen or if there is a setting or something I need to do to address it.
You have confirmed that the part you removed is not present in any reads?
Way ht-seq counts is shown in the illustration here: https://htseq.readthedocs.io/en/release_0.11.1/count.html
Thanks for the soft-clipping suggestion. Yes, it shows that all but 10 bases at the 5' end and 10 bases at the 3' end of the ORF are missing in the alignment.