I am trying to calculate the frequency of every pathogenic germline variant in every disease cohort. For ex: for variant 1:17588689, there are 20 het.variants (column H), I need to report what percentage of these samples are in glioma cohort, what percent in meningioma cohort..etc. I should append that information for each disease cohort, i.e. (meningioma, glioma, schwannoma, pituitary adenoma, others) at the end of the last column. So, also have to make sure the sum of the percentages add up to 1. The heterozygous sample IDs are listed in column-AZ, titled "HetSamples" and IDs are separated by comma in my dataset. I am stuck at some point so I would appreciate if someone can assist me to complete it.
library(stringr)
library(tidyr)
library(dplyr)
diseaseData <- read.delim(".../.txt", header = T, sep = "\t") #disease cohort informations
variantData <- read.delim(".../.txt", header = T, sep = "\t")
variantData <- variantData %>%
mutate(HetSamples = strsplit(as.character(HetSamples), ",")) %>%
unnest(HetSamples)
variantDataOld <- variantData %>%
filter(!str_detect(HetSamples, 'U'))
variantDataNew <- variantData %>%
filter(str_detect(HetSamples, 'U'))
diseaseDataOld <- diseaseData %>%
filter(!str_detect(ClinicalSeqID, 'U'))
diseaseDataNew <- diseaseData %>%
filter(str_detect(ClinicalSeqID, 'U'))
data.frame(do.call("rbind", strsplit(as.character(variantDataOld$HetSamples), "-", fixed = TRUE)))
data.frame(do.call("rbind", strsplit(as.character(diseaseDataOld$ClinicalSeqID), "-", fixed = TRUE)))
variantDataOld[c('Col1', 'Col2', 'Col3')] <- str_split_fixed(variantDataOld$HetSamples, '-', 3)
cross posted: https://stackoverflow.com/questions/72597194