Entering edit mode
4.5 years ago
ww22runner
▴
60
Hello everyone,
I have a bulk RNA sequencing results from WT and KO mice for a particular gene and I have 28 samples - 14 WT and 14 KOs. I am trying to generate a network file from ARACNe-AP as input for Viper to study transcription factors that may be involved but I am running into the problem where I am getting an empty network file. Here are the commands I used:
> java -Xmx5G -jar dist/aracne.jar -e test/my_data/small_subset.txt -o outputFolder --tfs test/my_data/small_subset_tf.txt --pvalue 1E-8 --seed 1 \
--calculateThreshold
> for i in {1..100}
do
java -Xmx5G -jar dist/aracne.jar -e test/my_data/small_subset.txt -o outputFolder --tfs test/my_data/small_subset_tf.txt --pvalue 1E-8 --seed $i
done
> java -Xmx5G -jar dist/aracne.jar -o outputFolder --consolidate
My expression matrix looks like this where I have used NCBI gene Ids (mice) in the column gene.
gene Sample1 Sample2 Sample3 Sample4 Sample5 ... Sample28
216795 67 56 84 23 139
and my tf file looks like this and contains NCBI gene Ids for tfs in mice:
24208
24209
24252
24253
24309
24330
24333
Any advice would be greatly appreciated, thank you!
Presume you have genes in the tf file that are also in the expression matrix and expressed?
What's in
outputFolder
after yourcalculateThreshold
?If you want to share files I can see if it runs here?
Hi Bruce, thank you for your reply, you are right in that I do not have genes in my tf file that also are present in the expression matrix and this was the problem. Being new to this tool, I had tried reading papers such as (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5040167/pdf/nihms789775.pdf) but do not completely understand how it works. I had a gene expression matrix that contained a set of differentially expressed genes from the experiment and to generate the list of tfs in mice, I looked for this information online (lists that others had used in analysis etc.) . Is there a better way for me to generate a list of tfs if I am unsure of what pathways may be involved in the KO mice? May I also ask, how ARACNe-CP draws connections between genes and their regulatory counterparts in very basic terms (in particular how the overlap of genes in the 2 inputs is important)? Thank you!
I'd start by reading the original ARACNe paper
The basic premise is that genes act in networks, but defining the network is not as simple as geneA correlates with genesB, C and D, and therefore regulates them.
ARACNe looks for direct pairwise interaction (e.g. between geneA+B, geneB+C, etc.). This results in many false positives because if geneA regulates geneB, and geneB regulates geneC, you will probably think geneA regulates geneC.
To define interactions, the expression matrix values are used against the TF/target gene vs. all genes in the matrix. The measure of statistical dependence (i.e. likelihood of regulation through direct interaction) is the
mutual information
which is 0 for complete independence (no regulation).This is why you need overlap of TF list and expression matrix. If the expression matrix has no values for the TF/targets, you cannot know what the MI is, and it is set to 0.
In terms of lists of TF/target, you can use whatever set you like. I use Biomart and screen using the Gene Ontology (GO) term GO:0003700, which is 'transcription factor activity, sequence-specific DNA binding', you can use others or take lists from databases. I also include the DE genes from my experiment, if there is a regulatory gene in there then that is of particular interest (doesn't have to be a TF to regulate other genes necessarily).
Hope that helps.
Hi Bruce, that was extremely helpful, thank you so much!
I think the tf file has to have the same header as the first column of the expression matrix, so:
Thank you for your reply Bruce but unfortunately it still gives me an empty output and I see something like this: