Hi,
I have a set of whole genome sequencing data, I want to find driver mutations so I am trying to use dndscv tool. But, I am permanently getting this error
> dndsout = dndscv(mutation)
[1] Loading the environment...
[2] Annotating the mutations...
Error in dndscv(r) :
Zero coding substitutions found in this dataset. Unable to run dndscv.
In addition: Warning messages:
1: In dndscv(r) :
Mutations observed in contiguous sites within a sample. Please annotate or remove dinucleotide or complex substitutions for best results.
2: In dndscv(r) :
Same mutations observed in different sampleIDs. Please verify that these are independent events and remove duplicates otherwise.
3: In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)
>
This is the header and tail of my data
> head(mutation)
sampleID chr pos ref mut
1 Sample_1 chr1 7914928 A ATAC
2 Sample_1 chr1 15080062 A AG
3 Sample_1 chr1 16714609 C CAT
4 Sample_1 chr1 19078857 TA T
5 Sample_1 chr1 22621124 TA T
6 Sample_1 chr1 34556929 TA T
> tail(mutation)
sampleID chr pos ref mut
85912 Sample_4 hs37d5 33864042 A G
85913 Sample_4 hs37d5 34253273 C A
85914 Sample_4 hs37d5 35186366 A C
85915 Sample_4 hs37d5 35222851 A G
85916 Sample_4 hs37d5 35232459 A C
85917 Sample_4 hs37d5 35295511 A G
>
Anyone knows how I can solve this error?
Thank you for your time assigning to my problem, without asking for your help for sure for days I had to wandering in google but I know I would not find the solution myself
My data not only contained chr prefix but also something like hs37d5 and chrM and so on, so by your kindly comment I removed all then code worked. However, I am not sure if it is right to ignoring these contigs.
I think it is fine to dismiss those contigs to study selection and to search for cancer genes. However, if you are very interested in including these contigs in the analysis, you can build your own RefCDS object using the buildref function in the package (see a tutorial below). But I would not worry about this unless you have a particular interest in these contigs. http://htmlpreview.github.io/?http://github.com/im3sanger/dndscv/blob/master/vignettes/buildref.html
The mitochondrial genome (chrM) is a special one and cannot be analysed with the rest of the genome. The reasons are that the genetic code is different (which matters when running dN/dS) and that the mitochondrial genome has very different mutation rates and spectrum that need to be modelled separately. For cancer gene discovery it is reasonable to exclude the mitochondrial genome, however, if you are interested in studying selection in the mitochondria you can again use the buildref function to build an object for it. The mitochondria has an additional challenge in that there is a very strong replication strand asymmetry. So, if you were interested in mitochondria, I would recommend running dNdScv separately for the MT-ND6 gene (see paper below, but note that the dN/dS model that we used in this paper was developed before dNdScv). https://elifesciences.org/articles/02935
But I think that almost all users can rely on the default object from dNdScv and not worry about the additional contigs or about the mitochondrial genome.
Sorry,
For getting better results, do you suggest me to use original vcf files as input for your software or I should first filtrate vcf file for minor allele frequency and variant with deleterious impact of the protein?
I would need to know more about your project to give you a good answer. It all depends on how clean your table of somatic mutations is. Many false positives and artefacts tend to have low allele frequencies, and filtering them out may improve the results. However, if your dataset is clean, you don't need to do that. You should also never filter your mutations based on their impact to the protein before running dNdScv, because the method expects an unbiased dataset (e.g. dN/dS needs synonymous mutations to model the mutation rates).
Thanks a lot, very very appreciated as I was struggling especially about the filtration based on influencing proteins.
My project is
responding or non-responding to chemotherapy in oesophagus adenocarcinoma
; So, we have whole genome sequencing of responder patients (tumour vs normal tissue) and matched non-responder patients to chemotherapy. Being responders or non-responders has been assessed by mandard scoring (TRG). However, as different patients carrying different and heterogeneous tumours, I am not sure if finding the driver genes could help in answering what is the difference in the genomes of responders versus non-responders.The authors of
The landscape of selection in 551 esophageal adenocarcinomas defines genomic biomarkers for the clinic highly recommended me to use dNdScv as their best performing tool, so I hope I could find a signature in responders and non-responders to chemotherapy by finding significant driver genes. Please kindly give me a clue if I am in wrong path to answer my question.
You can certainly try dNdScv, running it separately on each set of patients and on both sets together. This will give you lists of significant genes and some measure of selection. However, dNdScv is not designed to look for differences between two cohorts, so you may want to consider alternative approaches (e.g. statistically compare the fraction of patients with a driver mutation in a given gene in both sets of patients). I would also advice you to contact the authors of that paper for advice, since they will be more familiar with the type of cancer that you are analysing.
Sorry,
I likely analyzed my data with dNdScv successfully; However, I am not sure how I could achieve three ouputs
1- For getting the percentage of driver mutations in each gene and in different mutation classes I am not sure which table gives me these. For example
If I am not wrong for TP53 the class of mutation is missence?
2- Output of the mean number of exonic driver mutations per case
3- Passenger mutation rates
I am sorry to be this much disturbing
Hello, The outputs that you are asking for are not standard outputs from dndscv. They can be obtained with some additional calculations based on the outputs of dNdScv and following the methods of my paper in Cell. However, the risk of these measures being misused or misinterpreted by users without a good understanding of dNdScv was so high, that I decided not to make them part of the standard output.
Answering your questions specifically:
Confidence intervals for dN/dS values per gene are not available and they would ideally be needed for the question that you ask. I will consider adding them to future versions of dNdScv. In the meantime, note that the columns wmis_cv and wnon_cv are the maximum likelihood estimates (point estimates) of the dN/dS values per gene. For example, for TP53 dN/dS ratios for missense and truncating (nonsense + essential splice site) mutations are 247.5 and 622, respectively. dN/dS ratios measure the excess of mutations observed over those expected by chance under neutrality. So, you can calculate the excess of missense mutations in TP53 as: (247.5-1)/247.5*14 = 13.94. That is, the rate of passenger missense mutations expected in TP53 under neutrality was very low (<<1) and most likely all 14 missense mutations seeing in TP53 are drivers. As I said, ideally one would use confidence intervals for the dN/dS values to add uncertainty to these calculations. Also, note that in the absence of confidence intervals, I would only calculate these values for genes with q-value<0.05 (in this case only TP53).
The global estimate for the number of driver mutations per exome can be obtained using dndsout$globaldnds with the code below. Where nmuts is the number of mutations of each class (missense and nonsense+essential_splice) and nsamples is the number of samples (or exomes) studied. Note, however, that a number of assumptions are used in these calculations, so I would not use any of this unless you are confident that you understand the methodology and the principles of dNdScv and the suitability of your data.
Not sure what you mean by "passenger mutation rates". But the tutorial contains information that may be useful in this direction.
If you are trying to recapitulate the analyses of a given paper, like the one you mentioned before on esophageal adenocarcinomas, I would advice you to directly contact the authors for advice (and code) to reproduce their analyses.
I hope it helps. -Inigo
Thank you very much to be this much patient and helpful. Actually I have my own independent data with different question than the paper mentioned initially. The authors just suggested your software as their best performing tool for finding genes under positive selection. However I will need a lot of time to grab what you kindly described here. Sorry but now,
By
q-value
only a few genes will remain for me (one or two gene including TP53), could I instead usep-value
for reporting a list of possible driver genes ?Thank you again
Not really, when calculating p-values for 20,000 genes, you expect many (~1,000) to have p-values<0.05 by chance. q-values are corrected for multiple testing to avoid this and so q-values should always be used. You can use a less conservative q-value cutoff (e.g. 0.10) but I would be careful as you can start to include false positives in your list of significant genes.
There is a better way of increasing power to detect significance in known cancer genes, and that is running dNdScv on a list of known cancer genes. Or else, perform FDR correction (p.adjust function) on a list of cancer genes using the output from dNdScv (which is a better way of doing it as you use all genes to run dNdScv). In any case, this will only increase the power on already known cancer genes and needs to be reported as Restricted Hypothesis Testing (see example)
Sorry @ Inigo Martincorena
By my too little information, a gene is being considered as driver if be mutated in a wide range of samples; For instance in mentioned paper they are ignoring significant driver genes carrying less than 7 mutations. In my data even for a gene with
significant overall q-value
I have2 or maximum three samples
having this mutation; So, in your opinion please, could I still conclude that any of the genes in my results is a driver with this information?Thank you so much anyway
Can you show the output table (sel_cv) for this gene (feel free to remove the gene name if you want to keep it anonymous)? It is hard to tell without seeing the data, since there can be artefacts affecting the reliability of some results.
But, yes, dNdScv can sometimes successfully identify recurrently mutated genes with very few mutations. For example, considering the number of samples, the size of a gene, its sequence and the trinucleotide mutation rates in a dataset, the model may estimate that only 0.001 mutations are expected in a given gene by chance. If you saw 3 mutations in that gene, its significance could be very high.
Thank you very much
For instance, here CHST2 is significant
(q-trunc_cv)
but in compared to TP53 with a lot of mutations, this gene carries very few mutationsTo avoid incurring into multiple testing problems, I would recommend using "qallsubs_cv", which combines the evidence from missense and truncating substitutions. You can see that TP53 has a very low qallsubs_cv (<0.001), while CHST2 is borderline or non-significant (0.0905). The q-value that you are referring to (qtrunc_cv) is the combined p-value for truncating substitutions (nonsense and essential splice site mutations). As you can see in the n_non and n_spl columns, there are 3 truncating substitutions in TP53 and 2 in CHST2, which is consistent with their q-values for truncating substitutions being 0.001 and 0.02 respectively. Also, note that the size of the gene, its sequence and the mutational signatures are taken into consideration to calculate p-values, and not just the absolute number of mutations.
Sorry to be too much silly in questioning;
Could I run your software for finding non-coding regions of human genome?
For instance likely
https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMjE3NzgvZWxpZmUtMjE3Nzgtc3VwcDQtdjMueGxzeA==/elife-21778-supp4-v3.xlsx?_hash=KQi5jfO3kT2c4Qw44j4Rg6YAyCBQilYuWHVYXcRDuuo%3D
Gives non-coding parts of genome
Do you mean running an analysis to detect significantly mutated non-coding regions? Unfortunately, the dNdScv package only works on coding mutations at the moment, although I may published a new function (called NBR) to do this in the near future. There are some methods out there to do what you say, but they tend to give long lists of false positives (see the paper below): https://www.biorxiv.org/content/10.1101/237313v1
Sorry for coming back again
By strelka I have called SNV on my samples
Now I have one .vcf file contains SNV for each cancer-organoid pair of each patient
I spitted this .vcf file to two .vcf files as ORGANOID and TUMOR
I want two .vcf files for my paired samples because I need to check the consistency of mutations in two paired samples For instance for TP53 I want to know if this gene was mutated in both samples or just one of them
by
dndsout$annotmuts
I have samples and genes mutated in samples, I am wondering is this the same I need to find the consistency between significantly mutated genes for a pair of samples?Thank you so much for your time