I have been trying to run Inigo Martincorena's dNdScv on some exome seq data of mine and I am getting the error that zero coding substitutions are found in the dataset. It looks to me that having the data formatted differently from the example dataset is what has often caused this problem, but it looks like my data is formatted exactly the same and I'm still getting this error. My data is also aligned hg19 so I assume there should be no reference genome mismatch problems either.
Below is a sample of my data:
sampleID chr pos ref mut
1 Sample_1 1 808631 G A
2 Sample_1 1 808922 G A
3 Sample_1 1 808928 C T
4 Sample_1 1 809876 A G
5 Sample_1 1 865219 G A
Thanks for the response Inigo. I will confirm that there are indeed coding mutations within the data, though my dataset in all is about 200 samples that have been exome sequenced and contain a bit over 10 million identified variants, so I assume there should be plenty of coding substitutions.
I realize this is not much information to troubleshoot...
So, I was not paying close attention to the console and it turns out the samples contained too many variants. In the notebook file the error only mentioned that no coding substitutions were found, but in the console it also mentions that too many variants are in each sample.
Using max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf solved the problem.
Thanks Inigo.