Question

Low DE genes in DESeq2 (update PCA plots)

0

Entering edit mode

8.6 years ago

CandiceChuDVM ★ 2.5k

Hi all,

I want to use RNA-seq data to discover DE genes in dogs with a specific hereditary disease versus healthy dogs. I used Illumina Hiseq (paired-end) to sequence 8 samples (rapid=3, slow=3, control=2). Each of them has ~30M reads. However, I only got 15 genes with adjusted p-values were less than 0.1 in DEseq. Could somebody point our what's wrong? Here are my commands and outputs in each step:

Genome and annotation

genome: ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.CanFam3.1.dna.toplevel.fa.gz
annotation: ftp://ftp.ensembl.org/pub/current_gtf/canis_familiaris/Canis_familiaris.CanFam3.1.84.gtf.gz

HISAT2 Commands

hisat2-build -p 20 $GENOME canFam3

python hisat2_extract_splice_sites.py $ANNOT > splicesites.txt

hisat2 -p 20 \
   -q \
   --dta \
   --known-splicesite-infile $WORKDIR/splicesites.txt \
   -x $HST2IDX \
   -1 R1_combined.fastq.gz \
   -2 R2_combined.fastq.gz \
   -S align.sam

HISAT2 Outputs

samtools flagstat align.sam

65458550 + 0 in total (QC-passed reads + QC-failed reads)
6429010 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
62834643 + 0 mapped (95.99%:-nan%)
59029540 + 0 paired in sequencing
29514770 + 0 read1
29514770 + 0 read2
54628232 + 0 properly paired (92.54%:-nan%)
55418058 + 0 with itself and mate mapped
987575 + 0 singletons (1.67%:-nan%)
225944 + 0 with mate mapped to a different chr
200390 + 0 with mate mapped to a different chr (mapQ>=5)

hisat2 log:

29514770 reads; of these:
  29514770 (100.00%) were paired; of these:
2200654 (7.46%) aligned concordantly 0 times
22741847 (77.05%) aligned concordantly exactly 1 time
4572269 (15.49%) aligned concordantly >1 times
----
2200654 pairs aligned concordantly 0 times; of these:
  317519 (14.43%) aligned discordantly 1 time
----
1883135 pairs aligned 0 times concordantly or discordantly; of these:
  3766270 mates make up the pairs; of these:
    2623907 (69.67%) aligned 0 times
    808718 (21.47%) aligned exactly 1 time
    333645 (8.86%) aligned >1 times

95.55% overall alignment rate

Given that I got ~95% alignment rate, I suspect the problem lies in the following steps.

IGV image

Here is an IGV image from a sample. Is it well-aligned?

enter image description here

Samtools Commands

samtools view -S -u align.sam -o  align.bam

samtools sort -n -@ 20 -m 2G align.bam align.namesorted

HTseq Commands

htseq-count -f bam -i gene_id -r name -s no -t exon align.namesorted.bam $ANNOT > count.txt

HTseq Outputs

In count.txt:

(24580 lines of gene_id and counts are omitted here)
__no_feature    11982150
__ambiguous     250644
__too_low_aQual 1603475
__not_aligned   818166
__alignment_not_unique  4799372

DESeq2 Commands

sampleFiles <- grep("count",list.files(directory),value=TRUE) 
condition <- factor(c(rep("control",2), rep("rapid",3), rep("slow",3)))
timepoints <- factor(rep("t1",8))
sampleTable <- data.frame(sampleName = sampleFiles, 
                      fileName = sampleFiles, 
                      condition=condition, timepoints=timepoints)
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, 
                                   directory = directory, 
                                   design= ~condition)
dds <- dds[ rowSums(counts(dds)) > 3, ]
dds$condition <- relevel(dds$condition, ref="control")
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition","rapid","control"))
sum(res$padj < 0.1, na.rm=TRUE)

sum(res$padj < 0.1, na.rm=TRUE) is the step that gave me only got 15 DE genes.

RNA-Seq HTseq HISAT2 DEseq2 Samtools • 3.9k views

ADD COMMENT • link 8.6 years ago by CandiceChuDVM ★ 2.5k

0

Entering edit mode

htseq-count has an option to specify that your reads are sorted by name, -r name Not sure what's the default...

You lose a lot of counts in the __no_feature section o.o

ADD REPLY • link 8.6 years ago by WouterDeCoster 47k

0

Entering edit mode

One reason to use featureCounts. It sorts your files automatically before counting and it is fast.

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

I redid the htseq-count with -r name command. It does not change the result :(

ADD REPLY • link 8.6 years ago by CandiceChuDVM ★ 2.5k

0

Entering edit mode

You didn't show how dds was made.
timepoints is pointless
What makes you think the results are wrong?
Have you looked at the alignments in IGV? That's the only way to get a grip on why only a smallish percent of the alignments are high enough quality and on exons.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon,

Sorry that I missed the dda step. I have updated it my post.
timepoints is a prebuild column for future analysis. I only have t1 data now but I am going to analyze t1~t3 in the future. I know it's pointless now since I only have t1 data but I include it to remind myself for future analysis.
The rapid group and slow group are dogs with a hereditary disease so I expect to see a lot of DE genes compare to the control group. Also, I think the depth of my RNA-seq data is deep enough (~30M reads) to give me more than 15 DE genes. I am worried that the number of DE gene would be too small to allow me to do IPA.
I haven't checked IGV yet. Let me check then I will post my results.

I appreciate your suggestions!

ADD REPLY • link 8.6 years ago by CandiceChuDVM ★ 2.5k

0

Entering edit mode

I have uploaded the IGV image. It looks well-aligned to me.

ADD REPLY • link 8.6 years ago by CandiceChuDVM ★ 2.5k

0

Entering edit mode

Presumably you have a high degree of variability.

ADD REPLY • link 8.6 years ago by Devon Ryan 104k

score 3 · Answer 1 · 2016-04-29

3

Entering edit mode

8.6 years ago

Kristin Muench ▴ 640

How noisy is your data? It is possible that differentially expressed genes are hard to detect because of expression variance. Have you tried plotting the first two principal components of your data, or making a dendrogram of principal components? Ideally, it would separate into healthy and hereditary-variant-of-interest groups. If not, is it possible there is a batch effect in your data that you need to account for? e.g., with R package sva, and its ComBat function

ADD COMMENT • link 8.6 years ago by Kristin Muench ▴ 640

0

Entering edit mode

Thanks, Kristin! I have the dendrogram when I use Tuxedo suite to analyze my data: enter image description here However, I haven't tried to do PCA with this new analysis.

ADD REPLY • link 8.6 years ago by CandiceChuDVM ★ 2.5k

1

Entering edit mode

With DESeq2 you could simply use this:

dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = directory,  design= ~condition)
rld <- rlog(dds) # rlog transformation of your data as described in DESeq2 vignette
plotPCA(rld, intgroup="condition")

and obtain a better overview of the variabily across sample.

ADD REPLY • link 8.6 years ago by VHahaut ★ 1.2k

0

Entering edit mode

Thanks! I did the PCA plot and discovered my inter-condition variance is very big. They are much more similar timepoint-wise rather than condition-wise :( I wonder why the results from Cuffdiff and DEseq2 are so different. PCA plot based on condition PCA plot based on timepoints

ADD REPLY • link 8.6 years ago by CandiceChuDVM ★ 2.5k

1

Entering edit mode

I've also seen really big differences between DE genes called by cuffdiff and DE genes called by DESeq2 in my data. I'm not sure why, but in general people put much more faith into the DESeq2 results. I'm guessing the differences could arise because of differences in how cuffdiff and DESeq do the test - for example, cuffdiff normalizes genes by recasting expression as FPKMs, and FPKM values can potentially be skewed by changes in expression of large genes. DESeq has a more subtle way of adjusting expression levels for library size.

P.S. A note on dendrograms - I've heard it's better/more informative to do dendrograms on principal components rather than raw data because PCs are orthogonal/independent, whereas raw data might not be.

ADD REPLY • link 8.6 years ago by Kristin Muench ▴ 640