Hey folks.
I am trying to find differential transcript expression with the Hisat2-stringtie-ballgown pipeline. Originally, I established everything comparing two groups with six samples each. It looked promising and I got way over 1000 significant transcripts according to the resulting q-value.
Next step was to include more groups (also each with six replicates) into my workflow. The hisat2 step was done on each sample separately.
Script at the end of this post shows the steps after alignment and indexing but before ballgown.
Like this I realized that also the merged files will be different since everything is now built on all seven group and 42 samples instead of my initial trial with two groups, thus only 12 samples.
So I expected to some variations in my significant transcripts list. however, what I eventually saw were only a handful of significant transcript ids.
Here is the command line:
# get transcript assemblies in gtf
for file in bam/*.bam; do
base=$(basename "$file" .bam)
stringtie -p 16 --rf -G /home/chuddy/bioinformatics/database_references/mouse_39/annotation/gencode_vm34/gencode.vM34.annotation.gtf \
-o gtf/${base}.gtf -l ${base} "$file"
done
# merge transcript gtf files
readlink -f gtf/* > gtf/mergelist.txt
stringtie --merge -p 16 \
-G /home/chuddy/bioinformatics/database_references/mouse_39/annotation/gencode_vm34/gencode.vM34.annotation.gtf \
-o gtf/stringtie_merged.gtf gtf/mergelist.txt
gffcompare -r /home/chuddy/bioinformatics/database_references/mouse_39/annotation/gencode_vm34/gencode.vM34.annotation.gtf \
-G -o summaries/gffcompare/merged gtf/stringtie_merged.gtf
# calc abundance
for file in bam/*.bam; do
base=$(basename "$file" .bam)
stringtie -e -B -p 16 -G gtf/stringtie_merged.gtf -o ballgown/${base}/${base}.gtf "$file"
done
And from here the R-script for ballgown:
pheno_data = read.csv("meta_all.csv")
unique_groups <- unique(pheno_data$group)
bg_list <- list()
for (group in unique_groups) {
if (group != "normoxia") {
filtered_pheno_data <- pheno_data[pheno_data$group %in% c("normoxia", group), ]
sample_pattern <- paste("normoxia|", group, sep = "")
# create ballgown object
bg_data <- ballgown(dataDir = "ballgown",
samplePattern = sample_pattern,
pData = filtered_pheno_data)
bg_list[[sample_pattern]] <- bg_data
}
}
indexes(bg_data_filt)$pData$group <- fct_relevel(indexes(bg_data_filt)$pData$group, "normoxia")
# we only want the most variable transcripts between conditions
bg_data_filt <- subset(bg_list[[1]], "rowVars(texpr(bg_data)) > 1", genomesubset=TRUE) # "1" was used in publication
# stats
# comparing transcripts
results_transcripts <- stattest(bg_data_filt,
feature="transcript",
covariate="group",
getFC=TRUE,
meas="FPKM")
What is left to ask:
- is there any obvious mistake on my side that explains the huge difference in significant transcripts?
- why can't I make one big ballgown object using the metadata but instead needed to filter for the pairwise analysis already at the step of
ballgown()
? Did I miss a neat trick there?
cheers Tom
sry for the formatting. couldn't change it.
I've reformatted your post.
thank you a lot! : )