Hey there,
I am using the Kallisto / Sleuth pipeline for differential expression of a two condition (WT and KO) RNA-seq dataset (mouse) and I am having a problem with performing the gene-level differential expression in Sleuth. Part of the problem is that I am unsure how to integrate the gene_ids and transcript_ids into the Sleuth Object (I can get transcript-level differential expression to work, complete code for that not shown). Any help would be fantastic!
BACKGROUND
1) For pseudo-alignment using Kallisto (kallisto/0.45.1) I generated a transcriptome.idx file using Gencode gencode.vM23.transcripts.fa (Nucleotide sequences of all transcripts on the reference chromosomes). I used Gencode rather than ENSEMBL because the ENSEMBL cDNA FASTA file didn't contain noncoding RNAs.
2) Pseudo-alignment (Kallisto) for all of the FASTQ files works well (High TPMs were found for the target gene in WT abundance.tsv files and zero TPMs were found for the target gene in the KO abundance.tsv files)
PROBLEM:
I am having an issue associating transcripts to genes in order to do gene-level quantification in Sleuth (v0.30.0). I think there are a few factors at play:
3) Because I am using the Gencode transcriptome index, the output of my Kallisto run in the abudnance.tsv files contain a column that has gene_id, transcript_id, gene_symbol, biotype, etc all separated by a pipe (example shown below).
target_id ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|
Whereas the Pachter lab tutorial shows the Kallisto output with the first column (header=target_ID) having only ENSEMBL transcript IDs. (https://pachterlab.github.io/sleuth_walkthroughs/pval_agg/analysis.html)
4) When I build the Sleuth Object with the Ensembl Biomart IDs I get an error:
#load libraries
library(sleuth)`
library(cowplot)`
library(biomaRt)
#set base directory
base_dir <- "path_to_base_directory"
#get sample ids which is just the names of kallisto run folders
sample_ids <- dir(file.path(base_dir, "path_to_folders"))
#get directories where each kallisto run is located
kal_dir <- file.path(base_dir, "path_to_folders", sample_ids)
#metadata
metadata <- read.table(file.path(base_dir,
"meta_data/meta_data_hsc.txt"), header = TRUE, stringsAsFactors = FALSE)
#add file paths to the metadata table
metadata <- mutate(metadata, path = kal_dirs)
#check merge of metadata
print(metadata)
##associating transcripts to genes
#to perform gene-level analysis sleuth must parse a gene annotation. use using biomaRt and Ensembl
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "mmusculus_gene_ensembl",
host = "dec2015.archive.ensembl.org")
ttg <- biomaRt::getBM(
attributes = c("ensembl_transcript_id", "transcript_version",
"ensembl_gene_id", "external_gene_name", "description",
"transcript_biotype"),
mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
ttg <- dplyr::select(ttg, c('target_id', 'ens_gene', 'ext_gene'))
head(ttg)
##building a sleuth object for gene-level analysis
so <- sleuth_prep(metadata, target_mapping = ttg,
aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)
Notes from Pachter lab: The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata is the experimental design and target_mapping describes the transcript groupings into genes previously constructed. Furthermore, we provide an aggregation_column, the column name of in ‘target_mapping’ table that is used to aggregate the transcripts. When both ‘target_mapping’ and ‘aggregation_column’ are provided, sleuth will automatically run in gene mode, returning gene differential expression results that came from the aggregation of transcript p-values.
5) Error I get when building the Sleuth Object:
Error in check_target_mapping(tmp_names, target_mapping, !is.null(aggregation_column)) : couldn't solve nonzero intersection
3. stop("couldn't solve nonzero intersection")
2. check_target_mapping(tmp_names, target_mapping, !is.null(aggregation_column))
1. sleuth_prep(metadata, target_mapping = ttg, aggregation_column = "ens_gene", extra_bootstrap_summary = TRUE)
What should be the code if I wish to replace entire target_id ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|
with 4933401J01Rik-201. it is after 4th "|". It is transcript_id.
Also, if I wish to replace the line next to > with 4933401J01Rik-201. it should look like
"> 4933401J01Rik-201" instead of
"> ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1| OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|"
Thanks