how to make bootstrapped tree in PVCLUST package with SNP genotyping data?
1
0
Entering edit mode
7.2 years ago

Hi to all Good Afternoon

I would like to make genetic tree with SNP genotyping data contains data of 20 genotypes with 394 markers and the same would like to validate with bootstrapping technique. This is my data

https://www.dropbox.com/s/7mimhjh94550uaq/SNP%20genotyping%20data.csv?dl=0

Now I am using different R packages like adegenet, phangorn etc and converting my data to Genind object and then calculating distance using provesti function and exporting this distance matrix to my folder. Now I am using this distance matrix as input file in PVCLUST and making tree with AU and BP values. is there any possibility to do above only in PVCLUST with my SNP genotyping data? if yes can any one help me to solve this issue? any help in this regard will be highly appreciated Thanks in Advance

R • 4.0k views
ADD COMMENT
0
Entering edit mode

Dear Kevin Good Morning Thanks lot for your help and time spent on solving my problem. I formatted my data as you suggested like 0,1,2 but for some reason it is not working as expected, ok leave it I will go through PVCLUST package only. I would like to know how I can use your above function to run in R on my data? please let me know on this usage Thanking you very much Regards

ADD REPLY
0
Entering edit mode

Hello, I have replied with an answer to your question above. It is better to add your comments to existing threads in order to help others who may encounter this page with the same problem! I am also going to bookmark my own answer, because i believe that it is very useful.

ADD REPLY
1
Entering edit mode

Dear Kevin thanks lot for your help and your code working perfectly Great help Regards

ADD REPLY
0
Entering edit mode

Great - glad to have helped!

ADD REPLY
4
Entering edit mode
7.2 years ago

Good afternoon, pvclust does not support genotyping data directly; however, you could possibly encode your genotypes as 0, 1, and 2 (for absent, heterozygous, and homozygous, respectively).

PLINK already has some functions for clustering genotypes, but I do not believe that it employs bootstrapped clustering like pvclust. See here: http://zzz.bwh.harvard.edu/plink/strat.shtml

Finally, you could feasibly edit the pvclust code and create your own function. Just type pvclust in R once you've loaded the package and you'll see the source code:

function (data, method.hclust = "average", method.dist = "correlation", 
    use.cor = "pairwise.complete.obs", nboot = 1000, parallel = FALSE, 
    r = seq(0.5, 1.4, by = 0.1), store = FALSE, weight = FALSE, 
    iseed = NULL, quiet = FALSE) 
...
        pvclust.nonparallel(data = data, method.hclust = method.hclust, 
            method.dist = method.dist, use.cor = use.cor, nboot = nboot, 
            r = r, store = store, weight = weight, iseed = iseed, 
            quiet = quiet)
    }
}
<environment: namespace:pvclust>
ADD COMMENT
1
Entering edit mode

Hey friend,

I have to thank you for posting this question, as it has been a very interesting problem. The problem with pvclust is that it does not allow you to supply a pre-existing distance matrix, so, I thought that one could perhaps edit the pvclust code in order to trick it into accepting a distance matrix. I tried that this morning but the function would require major overhaul.

I then thought about 'reverse engineering' the distance matrix in order to produce original data. Your original data is genotype data, but instead we could produce numerical data that approximately equaled your genotype data. I then produced the following code:

#Create a 20 x 20 sample distance matrix of random numbers
MyData <- matrix(rexp(400, rate=0.5), ncol=20)
MyDist <- dist(MyData, method="euclidean")

#'MyDist' can also be produced from other programs, like PLINK on genotype values

#'Reverse engineer' the distance matrix in order to recover approximate co-ordinates that could have produced it
iSampleNumber <- 20
x <- cmdscale(MyDist, k=iSampleNumber-1)

pdf("pvclustWithExistingDistMatrix.pdf", width=7, height=3)
    par(mfrow=c(1,3), cex=0.5)
    plot(hclust(MyDist, method="ward.D2"), main="Using supplied distance matrix")

    plot(hclust(dist(x, method="euclidean"), method="ward.D2"), main="Using recovered co-ordinates\nfrom supplied distance-matrix")

    require("pvclust")
    plot(pvclust(t(x), method.dist="euclidean", method.hclust="ward.D2"), main="pvclust of recovered co-ordinates\nfrom supplied distance-matrix")
dev.off()

In your case, you would start with your distance matrix MyDist. As you can see, this code (using cmdscale) has reverse-engineered the distance matrix in order to produce a matrix of numbers that could have approximately produced your distance matrix.

pvclust
image host

Source of inspiration: http://personality-project.org/r/mds.html

ADD REPLY
0
Entering edit mode

Hi Kevin.Thanks for your help. I have a distance matrix.Here is my data and codes and the steps and the errors that I got: 1. MyDist <- read.table("a_reynolds.txt", comment.char = "", fill=TRUE, quote = "\"", header=FALSE)

2.iSampleNumber <-8

3.x <- cmdscale(MyDist, k=iSampleNumber-1)

Error in FUN(left, right) : non-numeric argument to binary operator

Would you please help me with this issue? I think the problem is in the first step and I don't know what's the best way to read this matrix? Thanks in advance for your help

Distance matrix (8*8) in .txt file:

        1   2   3   4   5   6   7   8
1   0   0.111273    0.115951    0.226427    0.305437    0.16177 0.214013    0.185147
2   0.111273    0   0.052679    0.160293    0.236025    0.095651    0.141876    0.119369
3   0.115951    0.052679    0   0.154649    0.206997    0.085574    0.157362    0.097636
4   0.226427    0.160293    0.154649    0   0.351431    0.20936 0.264392    0.22942
5   0.305437    0.236025    0.206997    0.351431    0   0.275669    0.335607    0.277761
6   0.16177 0.095651    0.085574    0.20936 0.275669    0   0.184818    0.15174
7   0.214013    0.141876    0.157362    0.264392    0.335607    0.184818    0   0.220484
8   0.185147    0.119369    0.097636    0.22942 0.277761    0.15174 0.220484    0
ADD REPLY
0
Entering edit mode

Hello, it should work... Did you try:

x <- cmdscale(data.matrix(MyDist), k=iSampleNumber-1)

?

ADD REPLY
0
Entering edit mode
MyDist <- read.table("a_reynolds.txt", comment.char = "", fill=TRUE, quote = "\"", header=FALSE)
rownames(MyDist) <- MyDist[,1]
MyDist <- MyDist[,-1]
iSampleNumber <- 8
x <- cmdscale(data.matrix(MyDist), k=iSampleNumber-1)
x
      [,1]         [,2]         [,3]         [,4]         [,5]
 -0.042475025 -0.013555514  0.081499057  0.087691800  0.014799627
 -0.010187509 -0.003351743 -0.008136623  0.010063058  0.008007693
  0.012392542  0.011268210 -0.003846960 -0.002006144  0.005066133
 -0.097177520  0.136076103 -0.054625472  0.002217971  0.006467648
  0.231370615  0.014889817 -0.027563950  0.013768633  0.003392256
 -0.020837731 -0.017238490  0.018543249 -0.018309687 -0.088069393
 -0.068869537 -0.125742267 -0.075126891 -0.001648332  0.017507204
 -0.004215834 -0.002346116  0.069257590 -0.091777300  0.032828833
     [,6]
  0.0026360333
 -0.0258060091
  0.0205055836
  0.0009395711
 -0.0004270558
 -0.0003186091
  0.0032660195
 -0.0007955333
ADD REPLY
0
Entering edit mode

Thanks Kevin. It didnt work.Here is the error: x <- cmdscale(data.matrix(MyDist), k=iSampleNumber-1) Warning message: In cmdscale(data.matrix(MyDist), k = iSampleNumber - 1) : only 6 of the first 7 eigenvalues are > 0

ADD REPLY
1
Entering edit mode

Yes, you could try k = iSampleNumber - 2

ADD REPLY
0
Entering edit mode

Worked perfectly.Great help! The only issue is that the location of each group in branches is different from what I got before in NJ tree (without Bootstraping value).Any idea? I got the plot but I found this warning message, too: Warning message: In a$p[] <- c(1, bp[r == 1]) : number of items to replace is not a multiple of replacement length

ADD REPLY
0
Entering edit mode

It make sense to have different results after bootstrapping. If your dataset is very 'non-uniform' (heteroskedastic), then you would expect very different results.

Regarding the other warning, it may be harmless or it may not be. From which function, exactly, does it arise? If you try a value of k that divides evenly into your dataset, like k=2 in a dataset of 8 samples), does it give that same warning?

ADD REPLY
0
Entering edit mode

The dataset is nearly uniform! I have 8 populations and in x I have [,1] to [,6] columns. Is it fine or I should have [,7] column, too? I got this error after the plot generated: .. Bootstrap (r = 1.38)... Done. Warning message: In a$p[] <- c(1, bp[r == 1]) : number of items to replace is not a multiple of replacement length And yes, the same error with k=2.

ADD REPLY

Login before adding your answer.

Traffic: 1692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6