Hi all,
I want to use RNA-seq data to discover DE genes in dogs with a specific hereditary disease versus healthy dogs. I used Illumina Hiseq (paired-end) to sequence 8 samples (rapid=3, slow=3, control=2). Each of them has ~30M reads. However, I only got 15 genes
with adjusted p-values were less than 0.1 in DEseq. Could somebody point our what's wrong? Here are my commands and outputs in each step:
Genome and annotation
genome: ftp://ftp.ensembl.org/pub/current_fasta/canis_familiaris/dna/Canis_familiaris.CanFam3.1.dna.toplevel.fa.gz
annotation: ftp://ftp.ensembl.org/pub/current_gtf/canis_familiaris/Canis_familiaris.CanFam3.1.84.gtf.gz
HISAT2 Commands
hisat2-build -p 20 $GENOME canFam3
python hisat2_extract_splice_sites.py $ANNOT > splicesites.txt
hisat2 -p 20 \
-q \
--dta \
--known-splicesite-infile $WORKDIR/splicesites.txt \
-x $HST2IDX \
-1 R1_combined.fastq.gz \
-2 R2_combined.fastq.gz \
-S align.sam
HISAT2 Outputs
samtools flagstat align.sam
65458550 + 0 in total (QC-passed reads + QC-failed reads)
6429010 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
62834643 + 0 mapped (95.99%:-nan%)
59029540 + 0 paired in sequencing
29514770 + 0 read1
29514770 + 0 read2
54628232 + 0 properly paired (92.54%:-nan%)
55418058 + 0 with itself and mate mapped
987575 + 0 singletons (1.67%:-nan%)
225944 + 0 with mate mapped to a different chr
200390 + 0 with mate mapped to a different chr (mapQ>=5)
hisat2 log:
29514770 reads; of these:
29514770 (100.00%) were paired; of these:
2200654 (7.46%) aligned concordantly 0 times
22741847 (77.05%) aligned concordantly exactly 1 time
4572269 (15.49%) aligned concordantly >1 times
----
2200654 pairs aligned concordantly 0 times; of these:
317519 (14.43%) aligned discordantly 1 time
----
1883135 pairs aligned 0 times concordantly or discordantly; of these:
3766270 mates make up the pairs; of these:
2623907 (69.67%) aligned 0 times
808718 (21.47%) aligned exactly 1 time
333645 (8.86%) aligned >1 times
95.55% overall alignment rate
Given that I got ~95% alignment rate, I suspect the problem lies in the following steps.
IGV image
Here is an IGV image from a sample. Is it well-aligned?
Samtools Commands
samtools view -S -u align.sam -o align.bam
samtools sort -n -@ 20 -m 2G align.bam align.namesorted
HTseq Commands
htseq-count -f bam -i gene_id -r name -s no -t exon align.namesorted.bam $ANNOT > count.txt
HTseq Outputs
In count.txt:
(24580 lines of gene_id and counts are omitted here)
__no_feature 11982150
__ambiguous 250644
__too_low_aQual 1603475
__not_aligned 818166
__alignment_not_unique 4799372
DESeq2 Commands
sampleFiles <- grep("count",list.files(directory),value=TRUE)
condition <- factor(c(rep("control",2), rep("rapid",3), rep("slow",3)))
timepoints <- factor(rep("t1",8))
sampleTable <- data.frame(sampleName = sampleFiles,
fileName = sampleFiles,
condition=condition, timepoints=timepoints)
dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = directory,
design= ~condition)
dds <- dds[ rowSums(counts(dds)) > 3, ]
dds$condition <- relevel(dds$condition, ref="control")
dds <- DESeq(dds)
res <- results(dds, contrast=c("condition","rapid","control"))
sum(res$padj < 0.1, na.rm=TRUE)
sum(res$padj < 0.1, na.rm=TRUE)
is the step that gave me only got 15 DE genes.
htseq-count has an option to specify that your reads are sorted by name, -r name Not sure what's the default...
You lose a lot of counts in the __no_feature section o.o
One reason to use featureCounts. It sorts your files automatically before counting and it is fast.
I redid the
htseq-count
with-r name
command. It does not change the result :(dds
was made.timepoints
is pointlessHi Devon,
dda
step. I have updated it my post.timepoints
is a prebuild column for future analysis. I only havet1
data now but I am going to analyzet1~t3
in the future. I know it's pointless now since I only havet1
data but I include it to remind myself for future analysis.I appreciate your suggestions!
I have uploaded the IGV image. It looks well-aligned to me.
Presumably you have a high degree of variability.