Question

Calculating the frequency of every pathogenic germline variant in every disease cohort?

0

Entering edit mode

2.5 years ago

Hasan • 0

I am trying to calculate the frequency of every pathogenic germline variant in every disease cohort. For ex: for variant 1:17588689, there are 20 het.variants (column H), I need to report what percentage of these samples are in glioma cohort, what percent in meningioma cohort..etc. I should append that information for each disease cohort, i.e. (meningioma, glioma, schwannoma, pituitary adenoma, others) at the end of the last column. So, also have to make sure the sum of the percentages add up to 1. The heterozygous sample IDs are listed in column-AZ, titled "HetSamples" and IDs are separated by comma in my dataset. I am stuck at some point so I would appreciate if someone can assist me to complete it.

library(stringr)
library(tidyr)
library(dplyr)

diseaseData <- read.delim(".../.txt", header = T, sep = "\t") #disease cohort informations
variantData <- read.delim(".../.txt", header = T, sep = "\t")

variantData <- variantData %>%
mutate(HetSamples = strsplit(as.character(HetSamples), ",")) %>%
unnest(HetSamples)
variantDataOld <- variantData %>%
filter(!str_detect(HetSamples, 'U'))
variantDataNew <- variantData %>%
filter(str_detect(HetSamples, 'U'))
diseaseDataOld <- diseaseData %>%
filter(!str_detect(ClinicalSeqID, 'U'))
diseaseDataNew <- diseaseData %>%
filter(str_detect(ClinicalSeqID, 'U'))

data.frame(do.call("rbind", strsplit(as.character(variantDataOld$HetSamples), "-", fixed = TRUE)))
data.frame(do.call("rbind", strsplit(as.character(diseaseDataOld$ClinicalSeqID), "-", fixed = TRUE)))

variantDataOld[c('Col1', 'Col2', 'Col3')] <- str_split_fixed(variantDataOld$HetSamples, '-', 3)

r bioconductor biostatistics • 507 views

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 2.5 years ago by Hasan • 0

0

Entering edit mode

cross posted: https://stackoverflow.com/questions/72597194

ADD REPLY • link 2.5 years ago by Pierre Lindenbaum 164k