I've been performing RNA-seq according to https://bioinformatics.uconn.edu/rnaseq-arabidopsis/ , following the hisat2-stringtie-ballgown pipeline, but it gave very high p-value and q-value, I've tried many different parameters, but it doesn't change. I am not sure if there are anything wrong in my pipeline.
I build the hisat_index by
pythotract_splice_sites.py /mnt/chaelab/donghui/TAIR10/gtf/Arabidopsis_thaliana.TAIR10.42.gtf >/mnt/chaelab/donghui/TAIR10/new_index/TAIR10.ss
python extract_exons.py /mnt/chaelab/donghui/TAIR10/gtf/Arabidopsis_thaliana.TAIR10.42.gtf >/mnt/chaelab/donghui/TAIR10/new_index/TAIR10.exon
hisat2-build --ss /mnt/chaelab/donghui/TAIR10/new_index/TAIR10.ss --exon /mnt/chaelab/donghui/TAIR10/new_index/TAIR10.exon /mnt/chaelab/donghui/TAIR10/TAIR10_chr_all.fas /mnt/chaelab/donghui/TAIR10/new_index/index
trim raw_data by AfterQC,
for i in {12,13,15,16}
do
python2 /mnt/chaelab/donghui/download/AfterQC-master/after.py -1 hisat/SRR34982${i}.fastq
done
then run the pipeline:
gtf=/mnt/chaelab/donghui/TAIR10/gtf/Arabidopsis_thaliana.TAIR10.42.gtf
index=/mnt/chaelab/donghui/TAIR10/new_index/index
mkdir 1_hisat 2_samtools_bam 3_stringtie_gtf 4_stringtie_merge 5_stringtie_for_ballgown
for i in {12,13,15,16}
do
hisat2 -p 8 --dta -x $index -q good/SRR34982${i}.good.fq -S 1_hisat/SRR34982${i}.sam
done
for i in {12,13,15,16}
do
samtools sort -@ 8 -o 2_samtools_bam/SRR34982${i}.bam 1_hisat/SRR34982${i}.sam
done
for i in {12,13,15,16}
do
stringtie -p 8 -G $gtf -o 3_stringtie_gtf/SRR34982${i}.gtf -l SRR34982${i} 2_samtools_bam/SRR34982${i}.bam
done
ls 3_stringtie_gtf/*.gtf > 3_stringtie_gtf/assebly_GTF_list.txt
stringtie --merge -p 8 -o 4_stringtie_merge/merged_gtf.gtf -G $gtf 3_stringtie_gtf/assebly_GTF_list.txt
for i in {12,13,15,16}
do
mkdir 5_stringtie_for_ballgown/SRR34982${i}
stringtie -e -B -p 8 -G 4_stringtie_merge/merged_gtf.gtf -o 5_stringtie_for_ballgown/SRR34982${i}/SRR34982${i}.gtf 2_samtools_bam/SRR34982${i}.bam
done
Then the R script:
library(ballgown)
library(RSkittleBrewer)
library(genefilter)
library(dplyr)
library(ggplot2)
library(gplots)
library(devtools)
dir <- "/mnt/chaelab/donghui/root_shoot/"
setwd(dir)
pheno_data <- data.frame( ID = c('SRR3498212','SRR3498213', 'SRR3498215', 'SRR3498216'),
Type= c( 'root','root','shoot', 'shoot'))
bg <- ballgown(dataDir = "5_stringtie_for_ballgown", samplePattern = 'SRR34982', pData = pheno_data )
print(bg)
ballgown instance with 54127 transcripts and 4 samples
#remove low-abundance genes ( 0 and 1 )
bg_filt <- subset(bg,"rowVars(texpr(bg)) >1", genomesubset=TRUE)
print(bg_filt)
ballgown instance with 18483 transcripts and 4 samples
results_transcripts <- stattest(bg_filt, feature = "transcript", covariate = "Type", getFC = TRUE, meas = "FPKM")
results_transcripts <- data.frame(geneNames = ballgown::geneNames(bg_filt),
geneIDs = ballgown::geneIDs(bg_filt),
results_transcripts)
summary(results_transcripts)
head(results_transcripts, n = 500)
subset(results_transcripts,results_transcripts$qval<0.05)
results_transcripts = arrange(results_transcripts,pval)
write.csv(results_transcripts, "afterqc_transcript_results.csv", row.names=FALSE)
however, the result shows very high q-val,
geneNames geneIDs feature id
. :8699 MSTRG.6333: 11 transcript:18483 1000 : 1
A1 : 9 MSTRG.2749: 10 10004 : 1
ARP1 : 8 MSTRG.5542: 7 10008 : 1
ATGRP9 : 8 MSTRG.8965: 7 1001 : 1
RPL13B : 7 MSTRG.1251: 6 10011 : 1
UBQ10 : 7 MSTRG.1364: 6 10014 : 1
(Other):9745 (Other) :18436 (Other):18477
fc pval qval
Min. :0.000e+00 Min. :0.0000135 Min. :0.2488
1st Qu.:0.000e+00 1st Qu.:0.1320150 1st Qu.:0.5278
Median :1.000e+00 Median :0.3798743 Median :0.7596
Mean :8.941e+10 Mean :0.4354506 Mean :0.7457
3rd Qu.:8.000e+00 3rd Qu.:0.7277507 3rd Qu.:0.9703
Max. :1.645e+15 Max. :0.9999579 Max. :1.0000
geneNames geneIDs feature id fc pval qval
2 ARV1 AT1G01020 transcript 2 4.887496e-02 0.42724866 0.7960507
11 DCL1 MSTRG.10 transcript 11 1.996840e+00 0.86585464 0.9883641
12 . MSTRG.11 transcript 12 2.095266e+00 0.91548484 0.9937310
14 MIR838A MSTRG.12 transcript 14 1.098434e+00 0.98874281 0.9937310
15 PPA1 MSTRG.13 transcript 15 2.263532e+04 0.10262150 0.4894779
21 LHY AT1G01060 transcript 21 3.983061e-04 0.18029668 0.5788472
23 LHY AT1G01060 transcript 23 1.375484e+06 0.08195434 0.4894779
26 . MSTRG.2 transcript 26 1.592113e-01 0.83509407 0.9704967
27 . MSTRG.2 transcript 27 6.134825e+01 0.65382588 0.9334463
29 . MSTRG.5 transcript 29 7.045163e+04 0.08195434 0.4894779
30 . MSTRG.5 transcript 30 4.287336e-03 0.36838524 0.7517703
32 PDH-E1 ALPHA MSTRG.6 transcript 32 4.143954e-01 0.50230252 0.8452166
33 RPP1A MSTRG.7 transcript 33 1.099668e-01 0.89197850 0.9937310
34 RPP1A MSTRG.7 transcript 34 1.440411e+08 0.25745333 0.6535037
36 RPP1A MSTRG.7 transcript 36 1.340859e-08 0.51091104 0.8510193
39 KCS1 MSTRG.1 transcript 39 5.171983e+00 0.01450149 0.4894779
41 CIPK9 AT1G01140 transcript 41 1.938509e+05 0.08195434 0.4894779
47 GIF2 MSTRG.15 transcript 47 3.886182e-01 0.31962657 0.7093423
and no transcripts showd qval<0.05, quite opposite from the https://bioinformatics.uconn.edu/rnaseq-arabidopsis/
I meet the same problem and I also want to find out the mistakes.