Any R Package To Calculate “Mutation Rates/Mismatches”?
2
1
Entering edit mode
12.5 years ago
iftikhar ▴ 10

I have a large family data (Parents and children). All of them are in the form of alleles. i.e two alleles at each locus and subject. Our problem is to calculate mutation rate/counts (mismatches between child's alleles and parent's alleles) at each locus. Is there any R package which can easily do that?

Data is given below.

child<-c(4,8)
mother<-c(3,3)
father<-c(4,1)

Two alleles of each family member are given in numeric form. I just want to calculate mismatches. In this case only one mismatch.

Thanks in advance

bioconductor • 7.0k views
ADD COMMENT
0
Entering edit mode

Did you define your problem well enough? Why do you believe it is possible to calculate mutation rates from your data? Note that a variant is not necessarily a mutation.

ADD REPLY
0
Entering edit mode

If the allele is present in the child but neither of the parents if it is not a sequencing error it must be a new mutation, or?

ADD REPLY
0
Entering edit mode

That sounds reasonable, but does it yield a mutation rate, how do you define mutation rate? I understand that you wish to detect all variant alleles not present in the parents (neither homozygous nor heterozygous calls), then count these and average over all family samples, that way giving an average "mutation rate" per generation? That sounds feasible to script but I am not sure if there is an R-package that does exactly this out of the box. If you could give an example of your data-format it should be easy to figure something out.

ADD REPLY
0
Entering edit mode

I am agree with you. It will not yield a mutation rate. Only it will calculate mismatches. And i also have added example in my question.

ADD REPLY
2
Entering edit mode
12.5 years ago

[Out of topic given the new details on this question]

Here is some food for thought on your question. Let us assume you have the following family/data in a fasta alignment:

>dad_allele1
ATCAAT
>dad_allele2
ATCGAT
>mum_allele1
ATGGAT
>mum_allele2
ATGGAT
>child_allele1
AGCAAT
>child_allele2
ATCGAT

You could have a look at the differences between each sequence in order to determine which child allele sequence is not identical to a parental allele:

library(seqinr)
x=read.alignment("fasta",format="fasta")
dist.alignment(x)

You would then get the following distance matrix, where if a child sequence has a distance >0 to all parental alleles, then it will be considered as bearing a mutation. This is the case of child_allele1:

              dad_allele1 dad_allele2 mum_allele1 mum_allele2 child_allele1
dad_allele2     0.4082483                                                  
mum_allele1     0.5773503   0.4082483                                      
mum_allele2     0.5773503   0.4082483   0.0000000                          
child_allele1   0.4082483   0.5773503   0.7071068   0.7071068              
child_allele2   0.4082483   0.0000000   0.4082483   0.4082483     0.5773503

Now comes the tricky part. If you want to compute a mutation rate, you will have to infer the parental allele from which this sequence came from. Should we take the less distant one? This is a question you should reflect upon.

If yes, then dad_allele1 would be the parental allele and you would have 1-(1-0.4082483²) = 0.1666667 mutations/site. This, of course, assumes that the distance between the two versions of the sequence (before and after mutation) is short enough not to contain several mutations at a given position.

ADD COMMENT
0
Entering edit mode

I would leave the variant calling to an external tool, e.g. samtools, also you must definitely support heterozygous alleles. I hope for ease of implementation, that the variants are in standard VCF format.

ADD REPLY
0
Entering edit mode

Further, it often does no do any good to answer a question before it has been properly posed.

ADD REPLY
0
Entering edit mode

I was just proposing a solution given the information I could read in the question. Now that the question has been edited to include more information, my answer does not make sense anymore. Unfortunately, this is not my first experience of this in Biostars.

ADD REPLY
0
Entering edit mode

It's true, that can be frustrating, but please don't be discouraged and keep contributing. Sometimes it's better to nag the posters for more information first. Often, lack of proper details indicates some other problems with the question or project behind it exist, often because the definition of the scientific question might be poor. Then it might be better to say, I notice what you are asking but from my experience that is not the information you really need to know.

ADD REPLY
0
Entering edit mode

You have a very good point there. Concerning my answer, I will edit it as [Out of topic] but will leave it for the sake of archiving.

ADD REPLY
0
Entering edit mode

I am really sorry for that. I thought that it was just a simple question. Actually one of my friend need that package and he is not much familiar with R and i am not much familiar with Biology. I think that here is the problem.

ADD REPLY
2
Entering edit mode
12.5 years ago
Michael 55k

I think you should really try to describe your project to us better, especially try to answer these essential questions first. What kind of technology was used for genotyping, e.g. illumina sequencing or chips? Which parts of the genome were genotyped, e.g. the whole exome? How was the variant calling done, e.g. samtools? In which formt do your variant calls come, e.g. VCF format? The fact that you omit this essential information tells me, that your project is not well-defined at the moment and needs some care. In case this is a thesis type of project I urgently suggest you talk to your supervisor about this.

In this case I think the primary question should not be (and therefore in your best interest I won't answer it right now), whether there is some R package that does some analysis, but whether the data you have are suitable to carry out the analysis you wish to do. We are not in a position to judge this yet, so please try to give more details, so we can help you better.

Let me add that the actual calculus to derive the number of mutations could be trivial. In the optimal case you would have the following data and analsysis steps:

  • High coverage full exome sequencing data for each trio
  • Aligned the reads against the reference genome
  • Heterozygous variant calls in all samples using standard tools e.g samtools or GATK
  • The variant calls in standard format: a single VCF file and an annotation of the family relation

  • Then you can perform a variety of tests, e.g. test for deviation from Hardy-Weinberg Equilibrium which could give indication for presence of, among others, mutation.

To discover novel variants, it might a good idea to calibrate all sample against all know variants in dbSNP. How many novel variants can be discovered that way?

Finally, to calculate mutation rates from parent to child generation, it will be sufficient to look at the new variants (deviation form reference genome) called in children and not called in parents. These will give an approximation of mutation rates+sequencing error induced variants. While it is possible, that a mutation occurs in a variable site, the probability is relatively low, compared with the whole genome, so I think you could safely neglect them. On the other hand, if you only have genotyped already know variants and are looking for novel alles introduced at these sites, then your ability to infer a mutation rate is hampered possibly by the lack of evidence, because these events are so rare.

ADD COMMENT
0
Entering edit mode

Thanks for your detailed answer. I am not much familiar with Biology. One of my friend is doing work on that and he is not familiar with R. Without going in details "I just want to calculate mismatches between mother's alleles and child's alleles, father's alleles and child's alleles"

ADD REPLY
0
Entering edit mode

Maybe then your friend could edit the question and provide a bit more information and an example of the data format, and especially information about how the data were processed, I mean such genotyping data don't come out of the blue, there must be some sequencing involved. I am sure we could figure something out.

ADD REPLY

Login before adding your answer.

Traffic: 2153 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6