I have a standard RNA-seq dataset, 3 replicates of treated and control. I have obtained the read counts according to 2 different pipelines, first with STAR and featureCounts (with ENCODE preferred options) and second with Salmon.
First pipeline (STAR -> featureCount -> DEseq2)
STAR \
--runThreadN 4 \
--limitBAMsortRAM 31000000000 \
--genomeLoad LoadAndKeep \
--genomeDir STAR_index \
--readFilesIn R1.fastq.gz R2.fastq.gz \
--readFilesCommand zcat \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outFileNamePrefix results/STAR/control_1/ \
--outSAMtype BAM SortedByCoordinate \
--chimSegmentMin 20
featureCounts \
-p \
-B \
-C \
-T 1 \
-s 2 \
-a Homo_sapiens.GRCh38.94.gtf \
-o control_x.txt \
Aligned.sortedByCoord.out.bam
DEseq2
dds <- DESeqDataSetFromMatrix(countData = countData, #a dataframe of the read counts
colData = colData,
design = ~ conditions, tidy=TRUE)
dds <- DESeq(dds)
deseq.results <- results(dds, contrast=c("conditions", "treated", "control"))
Second pipeline (Salmon -> DEseq2)
salmon quant \
-i samlon_index \
-p 4 \
-l A \
-1 R1.fastq.gz \
-2 R2.fastq.gz \
--validateMappings \
--gcBias \
-o control_x
DEseq2
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~conditions)
dds <- DESeq(dds)
deseq.results <- results(dds, contrast=c("conditions", "treated", "control"))
When I try to correlated the log2FoldChange between but output, I get fairly low correlation (about 0.2). Furthermore, I get 49 significantly DE genes from STAR and 64 from Salmon, of which only 3 are shared. From the literature, I was expecting much better correlations.
Am I doing something wrong in the process? If not, what can explain such wide difference?
Thanks!
EDITS
Of notes, I am not expecting many DEGs with this dataset. This has also been the case with other datasets with a greater number of DEGs.
plotMA(dds.star)
plotMA(dds.salmon)
Hmm, in any case you almost have no DEGs, I'd call this consistent. Can you show MA-plots of both pipelines?
plotMA
function.Please see How to add images to a Biostars post to add your images properly. You need the direct link to the image, not the link to the webpage that has the image embedded (which is what you have used here)
I have some questions about your usage of some STAR parameters.
Why do you allow a read to map at 20 positions with the parameter
--outFilterMultimapNmax 20
? Feature counts only counts uniquely mapped reads anyway.The parameters
-alignSJoverhangMin
and--alignSJDBoverhangMin
will not affect the resulting bam file if you don't state the parameter--outFilterType BySJout
. Also--alignSJDBoverhangMin 1
is very low. A read only needs one nucleotide right at one side of the annotated junction to be aligned to it.Are you sure, you want to allow chimeric alignment with the
--chimSegmentMin 20
parameter for DGE? This will incluse reads mapping to 2 chromosomes.Quite frankly, I used these parameters by default now. I have spent some time over different projects with various parameters and when I used these, which come from the ENCODE protocols, I systematically get higher uniquely mapped reads. I tend to simply move on when they give me uniquely mapped reads above 0.9. Filter like
--outFilterMultimapNmax
are kept as I used the BAM file for other pipelines in parallel of featureCount.--outFilterType BySJout
is simply missing from the cut and paste, sorry for the confusion.