Hi All, I'm working through the kissplice TWAS pipeline to identify condition-specific SNPs among 3 phenotypes in a wild fish species. Its quite a large dataset, with 4 or 5 replicates per phenotype and 10M paired reads per replicate. Using default parameters and the --experimental algorithm (commands below), kissplice finds over 1 million Type0a SNPs. Increasing the relative filtering cutoff to 10% (-C 0.1) still finds over 700,000 SNPs. This seems huge given that the phenotypes represent alternative outcomes of phenotypic plasticity, so do not result from fixed genetic differences. Also, when I use the results files in kissDE (v1.5.0), the kissplice2counts function runs for hours (so far overnight and still going). I can run the example file with this command in a second. While my result files contain a huge number of SNPs, this function should simply be creating a data frame for further analysis? What is the expected runtime for this?
I am interested in SNPs that are fixed or close to fixed among my conditions (e.g. that there may be underlying genetic predisposition to transition among the 3 phenotypes). I had planned to filter the data frame (e.g. remove SNPS with <10 reads mapped in at least one sample) in R, but am concerned the files are too large for the kissplice2counts function.
I'm surprised at the number of SNPs. We have a rough draft genome and estimate genome-wide heterozygosity at ~1.15%, so neither low or excessive. I'm running kissplice again with a higher kmer value of 50, to hopefully reduce the number of SNPs further.
Any advice on an expected kissDE runtime or ways to refine the number of SNPs returned given my dataset, would be very much appreciated. Thanks in advance.
kissDE command: snpCounts <- kissplice2counts("results_0.1_type0a.fa", pairedEnd=TRUE) Running in R version 3.4.4., on a Macbook Pro 3.1 GHz Processor, 16 GB memory.
Kissplice commands: kissplice -t 6 -s 1 -k 41 --experimental -r filenames.... kissplice -t 6 -s 1 -k 41 --experimental -C 0.1 -r filenames....
Thanks for the quick response, and the helpful suggestion. Some help with the script would be fantastic - as I have 3 conditions, I would be concerned that removing all SNPs with <10 counts in at least one individual might remove instances of differential expression between two conditions? Best regards, Erica.
Dear Erica,
Did the Bioconductor version of kissDE work for you?
Here you can find a very simple script to filter out some lowly-expressed bubbles: https://www.dropbox.com/s/6zfx946kqo7yxfb/remove_lowly_covered_bubbles.py?dl=1
You call it as:
where:
This script is far from ideal (i.e. it does not take into account biological replicates and conditions), but it can filter out many of uninteresting bubbles, with fairly low counts.
For example, to filter out all bubbles where the upper or the lower path contains more than 4 read files with less than 10 reads mapping to it, we would use:
If it does not help you, maybe we should take into conditions and biological replicates.