I am trying to calculate the percentage of heterozygous sites for each sample. I wrote a R script using stringr to count the number of heterozygous sites and divided by the number of known sites (the number of all sites minus the missing sites "N"). Here is my function:
The column of my data frame is for each sample and the row is for each SNP site.
PercentPolymorphism <- function(df) {
countN=c();
knownSitesN=c();
heteroSitesN=c();
PercPolym=c()
for (i in 1:ncol(df)) {
countN[i]= sum(str_count(df[, i], "N"));
knownSitesN[i]=nrow(df) - countN[i]
heteroSitesN[i] = sum(str_count(df[, i], c("R", "Y", "M", "K", "S", "W")));
PercPolym[i] = heteroSitesN[i]/knownSitesN[i]
}#for
df.new=as.data.frame(rbind(knownSitesN, heteroSitesN, PercPolym, df))
return(df.new)
} #function
I have around 5000 samples and merely 1Mb SNP data. This function works with the data without throwing any error, but it takes very long to complete.
Any one could suggest to modify my function in order to improve the computation efficiency?
Thanks in advance!
Li
Please post example data.
An example is as follows:
This contains six sample columns, not five.
use a character matrix instead of a data frame