Using Gencode transcript annotation in Kallisto / Sleuth pipeline
2
0
Entering edit mode
4.9 years ago
Jordan • 0

Hey there,

I am using the Kallisto / Sleuth pipeline for differential expression of a two condition (WT and KO) RNA-seq dataset (mouse) and I am having a problem with performing the gene-level differential expression in Sleuth. Part of the problem is that I am unsure how to integrate the gene_ids and transcript_ids into the Sleuth Object (I can get transcript-level differential expression to work, complete code for that not shown). Any help would be fantastic!

BACKGROUND

1) For pseudo-alignment using Kallisto (kallisto/0.45.1) I generated a transcriptome.idx file using Gencode gencode.vM23.transcripts.fa (Nucleotide sequences of all transcripts on the reference chromosomes). I used Gencode rather than ENSEMBL because the ENSEMBL cDNA FASTA file didn't contain noncoding RNAs.

2) Pseudo-alignment (Kallisto) for all of the FASTQ files works well (High TPMs were found for the target gene in WT abundance.tsv files and zero TPMs were found for the target gene in the KO abundance.tsv files)

PROBLEM:

I am having an issue associating transcripts to genes in order to do gene-level quantification in Sleuth (v0.30.0). I think there are a few factors at play:

3) Because I am using the Gencode transcriptome index, the output of my Kallisto run in the abudnance.tsv files contain a column that has gene_id, transcript_id, gene_symbol, biotype, etc all separated by a pipe (example shown below).

target_id ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|

Whereas the Pachter lab tutorial shows the Kallisto output with the first column (header=target_ID) having only ENSEMBL transcript IDs. (https://pachterlab.github.io/sleuth_walkthroughs/pval_agg/analysis.html)

4) When I build the Sleuth Object with the Ensembl Biomart IDs I get an error:

#load libraries
library(sleuth)`
library(cowplot)`
library(biomaRt)

#set base directory
base_dir <- "path_to_base_directory"

#get sample ids which is just the names of kallisto run folders
sample_ids <- dir(file.path(base_dir, "path_to_folders"))

#get directories where each kallisto run is located 
kal_dir <- file.path(base_dir, "path_to_folders", sample_ids)

#metadata
metadata <- read.table(file.path(base_dir,
     "meta_data/meta_data_hsc.txt"), header = TRUE, stringsAsFactors = FALSE)

#add file paths to the metadata table
metadata <- mutate(metadata, path = kal_dirs)

#check merge of metadata
print(metadata)

##associating transcripts to genes
#to perform gene-level analysis sleuth must parse a gene annotation. use using biomaRt and Ensembl
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
     dataset = "mmusculus_gene_ensembl",
     host = "dec2015.archive.ensembl.org")
ttg <- biomaRt::getBM(
     attributes = c("ensembl_transcript_id", "transcript_version",
     "ensembl_gene_id", "external_gene_name", "description",
     "transcript_biotype"),
      mart = mart)
ttg <- dplyr::rename(ttg, target_id = ensembl_transcript_id,
     ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
ttg <- dplyr::select(ttg, c('target_id', 'ens_gene', 'ext_gene'))
head(ttg)

##building a sleuth object for gene-level analysis
so <- sleuth_prep(metadata, target_mapping = ttg,
     aggregation_column = 'ens_gene', extra_bootstrap_summary = TRUE)

Notes from Pachter lab: The sleuth object contains specification of the experimental design, a map describing grouping of transcripts into genes (or other groups), and a number of user specific parameters. In the example that follows, metadata is the experimental design and target_mapping describes the transcript groupings into genes previously constructed. Furthermore, we provide an aggregation_column, the column name of in ‘target_mapping’ table that is used to aggregate the transcripts. When both ‘target_mapping’ and ‘aggregation_column’ are provided, sleuth will automatically run in gene mode, returning gene differential expression results that came from the aggregation of transcript p-values.

5) Error I get when building the Sleuth Object:

Error in check_target_mapping(tmp_names, target_mapping, !is.null(aggregation_column)) : couldn't solve nonzero intersection
3. stop("couldn't solve nonzero intersection")
2. check_target_mapping(tmp_names, target_mapping, !is.null(aggregation_column))
1. sleuth_prep(metadata, target_mapping = ttg, aggregation_column = "ens_gene", extra_bootstrap_summary = TRUE)
RNA-Seq Sleuth Kallisto Gencode • 4.0k views
ADD COMMENT
0
Entering edit mode
4.9 years ago
ATpoint 86k

I see two options:

1) -- the more general one: Reformat the initial fasta file to drop every name after the first pipe so that you only have left:

>ENSMUST....
(The Sequence)
>ENSMUST
(The Sequence)

or

2) -- the tedious one, manually "parse away" everything than the transcript name in your abundance files. Maybe this can be done directly in R, but I am not a kallisto user so I do not know how exactly these files look. I personally use salmon and they even have a --gencode flag for index generation to do what I described in 1) to avoid exactly this issue (not for Sleuth though, but in general, avoiding super-long names and reducing to only the txname).

ADD COMMENT
0
Entering edit mode
4.9 years ago

Was googling around for the same question, found an answer on the Kallisto/Sleuth google users group. Their solution was to remove the extra information from the hd5 files

https://groups.google.com/forum/#!searchin/kallisto-and-applications/gencode%7Csort:date/kallisto-and-applications/KQ8782UD35E/hbqqMOgGBwAJ


library(rhdf5)

files <- list.files(<insert working="" directory="">, pattern=".h5", recursive=TRUE, full.names=TRUE)

for (currentFile in files) { oldids <- h5read(currentFile, "/aux/ids") newids <- gsub("\|.*", "", oldids) h5write(newids, currentFile, "/aux/ids")

}

ADD COMMENT
0
Entering edit mode

What should be the code if I wish to replace entire target_id ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|

with 4933401J01Rik-201. it is after 4th "|". It is transcript_id.

Also, if I wish to replace the line next to > with 4933401J01Rik-201. it should look like

"> 4933401J01Rik-201" instead of

"> ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1| OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|"

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2387 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6