My question has two parts. If I should have broken my question up into two posts, please let me know and I will do so.
TL;DR I want to completely understand a specific example of a custom filter function, so that I am able to apply a modified version (or create my own) to my own RNA-seq dataset using the Sleuth package in R.
Part 1 Background:
When preparing a Sleuth object with the function 'sleuth_prep', the default transcript filter function applied is known as 'basic_filter'. The code for basic_filter is:
basic_filter <- function (row, min_reads = 5, min_prop = 0.47)
{
mean(row >= min_reads) >= min_prop
}
This filters out all transcripts that do not have at least 5 estimated counts in at least 47% of the samples. From my understanding, basic_filter is suitable for single factor, two level contrast experiments. However, applying the basic_filter to datasets with multiple factors and levels might result in lost information (i.e. potentially interesting transcripts being filtered out). I would like to reference a specific example of someone creating a custom filter function for a two-factor RNA-Seq experiment: the aim was to minimize the loss of interesting transcripts expressed in the interaction of those two factors, while also trying to reduce the inclusion of false positives.
http://achri.blogspot.com/2018/02/are-you-losing-important-genes-in-your.html
> meta
sample path condition cell_line
1 NSC_Ctl_1 NSC_Ctl_1.kallisto NSC Ctl
2 NSC_Ctl_2 NSC_Ctl_2.kallisto NSC Ctl
3 NSC_Ctl_3 NSC_Ctl_3.kallisto NSC Ctl
4 NSC_KO_1 NSC_KO_1.kallisto NSC KO
5 NSC_KO_2 NSC_KO_2.kallisto NSC KO
6 NSC_KO_3 NSC_KO_3.kallisto NSC KO
7 Odiff_Ctl_1 Odiff_Ctl_1.kallisto OD Ctl
8 Odiff_Ctl_2 Odiff_Ctl_2.kallisto OD Ctl
9 Odiff_Ctl_3 Odiff_Ctl_3.kallisto OD Ctl
10 Odiff_KO_1 Odiff_KO_1.kallisto OD KO
11 Odiff_KO_2 Odiff_KO_2.kallisto OD KO
12 Odiff_KO_3 Odiff_KO_3.kallisto OD KO
With the basic_filter applied:
> so <- sleuth_prep(meta, ~ condition+cell_line+condition:cell_line)
#reading in kallisto results
#............
#normalizing est_counts
#26036 targets passed the filter
#normalizing tpm
#merging in metadata
#normalizing bootstrap samples
#summarizing bootstraps
The author states that transcripts known to be differentially expressed were missing after applying basic_filter. The custom function the author creates:
> design_filter <- function(design, row, min_reads=5, min_prop = 0.47){
sum(apply(design, 2, function(x){
y <- as.factor(x);
return(max(tapply(row, y, function(f){sum(f >= min_reads)})/
tapply(row, y, length)) == 1
|| basic_filter(row, min_reads, min_prop)
)
})) > 0}
The result of applying the custom filter:
> so_redux <- sleuth_prep(meta, ~cell_line*condition,
filter_fun=function(x){design_filter(so$design,x)})
#reading in kallisto results
#............
#normalizing est_counts
#26370 targets passed the filter
#normalizing tpm
#merging in metadata
#normalizing bootstrap samples
#summarizing bootstraps
The author states this custom filter requires ~25% of samples to have "moderate expression" of the transcript, while adding the constraint that those 25% cover all replicates of some factor level. The custom filter passed 334 more transcripts than the basic_filter, which also included the author's known "true positive" transcript to appear in the results.
Part 1 Question:
Could someone, with the patience of a public school teacher, please walk me through understanding the author's custom function? I have a basic understanding of R commands, but I'm in the learning stages of being able to combine them into more complex custom functions. Learning how to follow this specific function will improve my overall understanding of R, and encourage me to create my own custom functions in the future.
Part 2 Background
I am processing RNA-seq data for my lab using Kallisto and Sleuth. The experiment includes 4 genotypes and 2 conditions (control vs treatment). In total there are 8 different combinations, and for each combination there are 3 biological replicates. The overall number of samples for this experiment is 8 x 3 = 24. The metadata (path information hidden):
> meta
sample genotype condition path
1 ad_neg2 ad control x
2 ad_neg3 ad control x
3 ad_neg5 ad control x
4 ad_plus1 ad treatment x
5 ad_plus2 ad treatment x
6 ad_plus3 ad treatment x
7 my_neg1 my control x
8 my_neg2 my control x
9 my_neg4 my control x
10 my_plus2 my treatment x
11 my_plus3 my treatment x
12 my_plus4 my treatment x
13 pc_neg1 pc control x
14 pc_neg2 pc control x
15 pc_neg3 pc control x
16 pc_plus1 pc treatment x
17 pc_plus3 pc treatment x
18 pc_plus4 pc treatment x
19 wt_neg2 wt control x
20 wt_neg3 wt control x
21 wt_neg5 wt control x
22 wt_plus1 wt treatment x
23 wt_plus2 wt treatment x
24 wt_plus3 wt treatment x
Part 2 Question:
I intend to create a Sleuth object with a model similar to the referenced author in Part 1 ~ genotype+condition+genotype:condition
and apply a similar custom filter function (i.e. require 12.5% of samples to have 5 mapped reads to a transcript, adding the constraint those 12.5% cover all replicates of a factor level). Does this filtering step seem reasonable? If so, how might I create a custom function for this?