I first analysed the data without taking into consideration the strand-specificity of my library, afterwards I found out that the library was stranded and I reanalysed the data. The difference in counts were very significant between two types of analyses. With no knowledge of strandness ( I got around 1000-2000 counts per gene I am interested in), however specifying strandness in hisat2 and htseq I got only 20-30 counts per some genes that I om interested in. The sequencing aimed at coverage 30 M. I just wanted to ensure that getting this difference in counts number is fine.
If the dataset is unstranded and you do antisense gene counts and sense gene counts your values should be about 50% antisense 50% sense.
If the dataset is stranded (assuming RNA-seq) and you do antisense gene counts and sense gene counts your values should >90% antisense and 10% sense. I usually observe an order of magnitude difference in read counts per stranded when investigating a stranded library.
Thanks for the answer! I am sort of new to bioinformatics can you please confirm it by some paper. I may understand why it should 50/50 for unstranded library but the proportion 90/10 for stranded RNA-seq library seems unexplainable to me so far.
With no knowledge of strandness ( I got around 1000-2000 counts per
gene I am interested in), however specifying strandness in hisat2 and
htseq I got only 20-30 counts per some genes that I om interested in.
That suggests to me that you put in the wrong strandedness. Put it in the other direction, and you should get your thousand counts back.
After checking the strandness with RSeQC I got the following stats:
This is PairEnd Data
Fraction of reads failed to determine: 0.0569
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0170
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9261
Can I conclude based on the stats that it is reverse stranded library?
Yes if it RNA-seq it is usually reverse stranded. To check this, if you just change your strand in your histat2 command and then re-run hisat2 and htseq your 20-30 counts should jump to 900-1800+.
Also you can view the BAM files in IGV to verify strandedness
Thanks for the answer! I am sort of new to bioinformatics can you please confirm it by some paper. I may understand why it should 50/50 for unstranded library but the proportion 90/10 for stranded RNA-seq library seems unexplainable to me so far.