Dear Team,
I am working on RNASeq data which is generated using Truseq straded HT kit.
I have done removing adapters and aligned reads using STAR 2 pass mode.
I am running feature counts (subread 1.5.1) using -s option three times 0,1 and 2. More reads were mapped to features when I specified -s as "0". But rseqc infer it as reverse stranded. I ran htseq-count (with default parameters) and it is also giving the strand as reverse. Here are the commands which I used:
featureCounts -t exon -s 1 -T 12 -g gene_id -a genes.gtf -o featurecounts.txt sample.starAligned.sortedByCoord.out.bam sum of counts : 34133722
featureCounts -t exon -s 2 -T 12 -g gene_id -a genes.gtf -o featurecounts.txt sample.starAligned.sortedByCoord.out.bam sum of counts : 34029447
featureCounts -t exon -s 0 -T 12 -g gene_id -a genes.gtf -o featurecounts.txt sample.starAligned.sortedByCoord.out.bam sum of counts : 63157326
Am I using parameters correctly? And summing up counts on 7th column of featurecounts output.Is this right way to predict strand?
Please let me know your suggestions.
Thanks In Advance Fazulur Rehaman
Hi igor,
Thanks a lot for your quick response.
You are right. Even I am sure that if it is stranded library, we should not get same count on both strands. In htseq-count also I got the numbers like below -s reverse 31842773 -s no 31610810 -s yes 2515562
But, I enquire our lab people who did sequencing and they have used dUTP method which is also mentioned in the link you shared.
Here is the result from rseqc infer_experiment.py on the sample which I performed featureCounts.
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0568 Fraction of reads explained by "1+-,1-+,2++,2--": 0.9432 Fraction of reads explained by other combinations: 0.0000
And I observed preQC as well as in postQC, %GC is also little high (53).
If there is DNA contamination, what steps do I perform. Please help me to resolve this.
Thanks In Advance Fazulur Rehaman
Your htseq-count numbers are very different from your featureCounts. That is odd. They should be very similar (within 1%).
I am not familiar with rseqc, but it should be able to calculate how mapped reads were distributed over genome feature (like CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions). If you have DNA contamination, those metrics would be similar.
Hi igor,
Thanks for your suggestions.
I am not sure why both tools were not giving similar results. I am using defaults. I will have a look into the same again. And sure, I will check rseqc metrics for any DNA contamination.
And also, I observed in featureCounts results, 20-30% of reads were not assigned due to mapped to multiple loci.
Please suggest me to resolve this.
Thanks & Regards Fazulur Rehaman
Hi igor,
As per your suggestion, I ran read_distribution.py on the same bam I am getting the output as below.
=====================================================================
From above stats, I observed I am not getting similar metrics.
And also I ran bam_stat.py to know the spliced reads.
I observed reads map to both '+' and '-" are almost same.
Could you please help me to resolve strand issue.
Thanks In Advance Fazulur Rehaman
Hi vrehaman,
I am in a similar situation like you. I see the issue is with strands setting where featurecounts with
-s 1
and-s 2
gave similar number of reads while I get double the number of reads assigned with-s 0
. With RSeQC, I see most reads fraction (>90%) assigned to strandwith HTSeq I get:
with featurecounts:
Sequencing provider mentioned libraries are 'stranded' prepared with NuGen Universal Plus RNA-seq kit.
I wonder if stranded protocol did not work well in this case and why the counts are very different with HTSeq and featurecounts. Ideally they should be similar with HTSeq
- s reverse
and featurecount-s 2
.Do you know what kit your data was sequenced with and please post if you have figured out anything further on this.
Are you sure you are properly accounting for paired reads? You have about 16M unstranded reads with HTSeq, but 32M with featureCounts. It looks like you ran featureCounts in single-read mode, which would explain twice as many reads and half of them switching strand.
Sound like you do not have a stranded library.