Hi all,
To make it simple. I tried both htseq and featurecounts to generate read counts for differential expressed gene analyisis using deseq2. Here are example outputs for one of my samples:
- Htseq (input bam file sorted by name using sortmeRNA)
code:
htseq-count -f bam -i Name ${i}_final_byname.bam B_anynana_v2.gff > ${i}_count_byname.txt
.
result (first 10 rows):
BANY.1.2.t00001 74
BANY.1.2.t00002 0
BANY.1.2.t00003 4
BANY.1.2.t00004 1
BANY.1.2.t00005 0
BANY.1.2.t00007 4
BANY.1.2.t00009 25
BANY.1.2.t00010 302
BANY.1.2.t00011 2
BANY.1.2.t00012 32
.
summary:
__no_feature 22888655
__ambiguous 31338
__too_low_aQual 2391179
__not_aligned 3147296
__alignment_not_unique 739779
- featurecounts (input bam file sorted by position using sortmeRNA)
code:
featureCounts -T 12 -p -g Name -a B_anynana_v2.gff -o ${i}_featurecounts_position.txt ${i}_final.bam
.
result (first 10 rows, only gene names and counts are shown here)
BANY.1.2.t00001 846
BANY.1.2.t00002 0
BANY.1.2.t00003 65
BANY.1.2.t00004 3
BANY.1.2.t00005 0
BANY.1.2.t00007 185
BANY.1.2.t00009 446
BANY.1.2.t00010 2459
BANY.1.2.t00011 31
BANY.1.2.t00012 568
.
summary:
Assigned 18226666
Unassigned_Unmapped 3147296
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 1971565
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 8481719
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 130189
As I can see they both have the same number of reads that are not aligned so I assume both went fine. However, I still have some questions about this...
- Why feature counts gave me significant higher counts for each gene, is it because of the multi-threaded method I used?
- During DE gene analysis using deseq2, I got a lot of genes being filtered out due to the relatively low counts in htseq counted files. And eventually, I got totally different lists of differential expressed genes for htseq and featurecount...which result should I use and trust?
I'm new to RNA-seq analysis. Valuable suggestions are highly appreciated.
can you try changing parameters for
s
option with htseq ? Default fors
(strandedness) is yes. @ 2822462298See Comparison Htseq And Feature Count. The differences you report are surprisingly large and suggest a difference in the basic settings.
The featureCount results will be unchanged by the number of threads used, so multi-threading is not an issue.
I have read that post but I still did not find an explanation for the big gap between the two programs, I know it has something to do with my basic setting but I really don't know what's wrong with it since I did read the manuals very carefully...
Just to be sure, please show the output of
samtools view -H your.bam | head -n 1)
. This should confirm if indeed name-sorted.Just checked for all my unsorted, name-sorted, and position-sorted files, it shows
respectively. I guess there is nothing wrong with the sorting.
Any ideas guys? :(((
Biostars, like most open science forums, is powered by volunteers who offer their time and expertise by volition and not by obligation. So, we cannot guarantee you an answer within a time period. We respond quickly, especially to questions posted on weekdays in the day.
Please be patient.
Hi genomax, I understand and I really appreciate all the suggestions provided. I apologize if I may sound a little bit impatient.
This looks odd indeed. Could you try to run featurecounts again with the -B option to include only complete pairs? Pragmatically, I would still say, go with featureCounts then if it works for you.
Thanks Michael, I tried including the -B option but the counts only reduced a little, they are still much higher than htseq.