Hello People,
I am analyzing 20 samples RNA-Seq data of leukemia. To this end, I have used the Hisat2-StingTie2 pipeline to analyze the samples and arrive at FPKM counts for each gene. After I have run the complete process, I found that the FPKM values are way too low for calculating Differential gene expression and it seems there is no difference in expression across samples.
B85 B86 B78 B89
Minimum 0 0 0 0
First Quartile 0.004382 0.004093 0.019453 0.011406
Median 0.150255 0.153044 0.167367 0.144218
Third Quartile 1.41 1.463101 1.180386 1.064904
Maximum 120290.64 144817.40 116797.73 119615.86
Skewness 91.670 97.440 86.229 93.291
These FPKM values were taken from StringTie file using -A options. These samples were processed using Illumina truseq stranded total mrna kit alongside ribo-depletion kit and was sequenced on NextSeq 550 with 2X75bp reads (avg 60mil reads per sample). Peculiar thing is Hisat2 have shown >90% mapping and there isn't even rRNA contamination. And on top, I have even mapped the reads to cDNA database which also showed >55% reads mapping.
The commands were as follows:
hisat2 -p 60 --dta -x hsa_GRCh38_index -1 sample_R1_pair.fastq.gz -2 sample_R2_pair.fastq.gz -S sample_map2genom.sam && samtools view -bSh -@ 45 sample_map2genom.sam | samtools sort -@ 45 > sample_map2genom.bam
stringtie ${i}_map2genom.bam -G Homo_sapiens.GRCh38.95.gtf -o sample_assembly_strg.gtf -p 60 -A ${i}_abundance.txt -B
This pipeline has worked great for me before but this particular output is totally puzzling me. Has anyone encountered this before ?? Please advice.
Thanks and regards
Hi Shawn,
These are WBCs (leukemia) extracted from FFPE samples but rna quality was good.
For number of genes >0 is ~35k , >1 is ~15-18k and >10 is ~3-5k
"I'm wondering if annotated but unexpressed genes are hiding the true signal"...I didn't exactly get that....
Poor wording on my part. I think your data looks reasonable, 15-18k with >1FPKM is in the realm of what I normally see. I think the fact that you have ~75% of genes with <1 FPKM is because most of them just aren't expressed. I would set an expression threshold and then analyze everything above that threshold in your differential expression.
I normally generate a density plot in R, it'll often show a bimodal curve, then I determine my cutoff for expression empirically.
Actually the problem is I don't see a "CURVE", its just a really long, sharp PEAK at ~0 FPKM value....it looks I might as well take a cut-off of FPKM value of 2 and still would retain more than 25K genes.
Another thing is, which FPKM do you consider ? The one that is given by -A option or the one that comes out from t_data.ctab....
Try making a density plot of the log10(FPKM), that should show a more appropriate curve. And I'm not sure which output to recommend, I haven't used HiSat before, I tend to stick to STAR/HTSeq/DESeq2 or more recently Salmon/DESeq2.
All RNA-seq count data is shifted toward zero, so, that is expected. RNA-seq counts follow a negative binomial distribution. I don't believe that production of FPKM expression values results in an overall change from this data distribution. A transformation of normalised counts via, e.g., variance-stabilisation, logging, Z-scaling, or a combination of these, will alter the distribution.
Also keep in mind that FPKM units are not suitable for differential expression comparisons.
Hi Kevin,
Thanks for reassurance and the information. As for the FPKM units, so which units can reliably be used for differential expression comparisons ?? TPM ??
Hey, if you just follow the recommendation by shawn (Salmon --> DESeq2), then you'll have data that is suitable for differential expression. Salmon will require input FASTQ files, I believe, and bypasses the production of a BAM file.
The other method, STAR --> HTSeq --> DESeq2, will produce a BAM file, with raw counts then derived by HTSeq.
Note, that normalisation and differential expression analysis occurs in DESeq2.
As you have already used StringTie, you can actually use a function, prepDE.py, which conveniently converts your HISAT2 / StringTie dataset to raw counts that are suitable for DESeq2: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq
Criticism of FPKM for differential expression analysis is out there - just do a search. You will not find many voices actively promoting it for such analyses (although publications do exist).