I have some single-cell RNA-seq data that I've processed via a tophat2
--> htseq-count
pipeline, and I've also generated some bed graphs via bedtools genomecov
. I have stumbled upon two issues that I don't seem to understand.
Issue 1
I've noticed that the htseq-count
and bedgraphs for this data is not necessarily in agreement. For example, we see peaks on IGV with cells that have 0 counts according to htseq. An example of this can be seen in cells H11 and G11. According to htseq-count
, H11 has 0 hits on Xist while G11 has 320. If we look at the IGV tracks for these cells we see peaks on both tracks despite the 0 in H11 (please see attached screenshot below; tracks are auto-scaled). My reasoning for this is because htseq-count does not account for multi mapped reads and these may be the read showing up on IGV. Also the bed graphs are not normalized/scaled. Does anyone have other insight? In my pipeline I've treated all cells as non-strand-specific (tophat2 argument --library-type fr-unstranded
and htseq-count argument --stranded=no
), which I believe is correct given the library prep conditions. The pipeline is briefed below. I should mention that no errors or warning were thrown thought the pipeline. Please let me know if anything strikes you as incorrect.
Issue 2
So in my RNA-seq data sometimes I see reads that do not correspond to an annotated gene/region, so let's say these are "New genes".
Would I see such regions in other published data, or did they somehow only provide coordinates for regions that are annotated? Because, just looking in IGV, I see no such regions in other published data, so I'm wondering if it could be because of the way they deposited their data.
Another thought is that these 'unannotated' regions are actually annotated but using USCS or other gene tracks.
Please refer to the screen shot posted below.