Question

ballgown shows very different results depending on number of groups prepared with hisat2-stringtie

0

Entering edit mode

7 weeks ago

thomas.heigl.ibk • 0

Hey folks.

I am trying to find differential transcript expression with the Hisat2-stringtie-ballgown pipeline. Originally, I established everything comparing two groups with six samples each. It looked promising and I got way over 1000 significant transcripts according to the resulting q-value.

Next step was to include more groups (also each with six replicates) into my workflow. The hisat2 step was done on each sample separately.

Script at the end of this post shows the steps after alignment and indexing but before ballgown.

Like this I realized that also the merged files will be different since everything is now built on all seven group and 42 samples instead of my initial trial with two groups, thus only 12 samples.

So I expected to some variations in my significant transcripts list. however, what I eventually saw were only a handful of significant transcript ids.

Here is the command line:

# get transcript assemblies in gtf
for file in bam/*.bam; do
    base=$(basename "$file" .bam)
    stringtie -p 16 --rf -G /home/chuddy/bioinformatics/database_references/mouse_39/annotation/gencode_vm34/gencode.vM34.annotation.gtf \
        -o gtf/${base}.gtf -l ${base} "$file"
done


# merge transcript gtf files
readlink -f gtf/* > gtf/mergelist.txt
stringtie --merge -p 16 \
  -G /home/chuddy/bioinformatics/database_references/mouse_39/annotation/gencode_vm34/gencode.vM34.annotation.gtf \
  -o gtf/stringtie_merged.gtf gtf/mergelist.txt     

gffcompare -r /home/chuddy/bioinformatics/database_references/mouse_39/annotation/gencode_vm34/gencode.vM34.annotation.gtf \
  -G -o summaries/gffcompare/merged gtf/stringtie_merged.gtf

# calc abundance 
for file in bam/*.bam; do
    base=$(basename "$file" .bam)
    stringtie -e -B -p 16 -G gtf/stringtie_merged.gtf -o ballgown/${base}/${base}.gtf "$file"
done

And from here the R-script for ballgown:

pheno_data = read.csv("meta_all.csv")
unique_groups <- unique(pheno_data$group)
bg_list <- list()

for (group in unique_groups) {
    if (group != "normoxia") {
        filtered_pheno_data <- pheno_data[pheno_data$group %in% c("normoxia", group), ]

        sample_pattern <- paste("normoxia|", group, sep = "")

        # create ballgown object
        bg_data <- ballgown(dataDir = "ballgown",
                                          samplePattern = sample_pattern,
                                          pData = filtered_pheno_data)

        bg_list[[sample_pattern]] <- bg_data
    }
}

indexes(bg_data_filt)$pData$group <- fct_relevel(indexes(bg_data_filt)$pData$group, "normoxia")

# we only want the most variable transcripts between conditions
bg_data_filt <- subset(bg_list[[1]], "rowVars(texpr(bg_data)) > 1", genomesubset=TRUE) # "1" was used in publication

# stats
# comparing transcripts
results_transcripts <- stattest(bg_data_filt, 
                                                   feature="transcript", 
                                                   covariate="group",
                                                   getFC=TRUE, 
                                                   meas="FPKM")

What is left to ask:

is there any obvious mistake on my side that explains the huge difference in significant transcripts?
why can't I make one big ballgown object using the metadata but instead needed to filter for the pairwise analysis already at the step of ballgown()? Did I miss a neat trick there?

cheers Tom

ballgown stringtie • 337 views

ADD COMMENT • link 7 weeks ago by thomas.heigl.ibk • 0

0

Entering edit mode

sry for the formatting. couldn't change it.

ADD REPLY • link 7 weeks ago by thomas.heigl.ibk • 0

1

Entering edit mode

I've reformatted your post.

ADD REPLY • link 7 weeks ago by Ram 44k

0

Entering edit mode

thank you a lot! : )

ADD REPLY • link 7 weeks ago by thomas.heigl.ibk • 0