Apply a function over multiple SNPs files
0
0
Entering edit mode
10.1 years ago
l0ka ▴ 10

I have a data frame with 2 columns, with the names of the normal samples and the corresponding names of the tumor samples of 111 patients.

# Data.frame with names of samples
normal <- c("00-000450N", "01-28TZ", "02-1562L", "02-2035R", "03-1334B")
tumor <- c("00-000450T", "01-28R", "02-1562TZ", "02-2035L", "03-1334T")
SIF <- data.frame(normal, tumor)

Then, I have 2 files with the datas about some segments, 1 for the tumor regions and one for the normal ones.

# Regions datas
sample <- c(rep("00-000450N", 5), rep("01-28TZ", 5), rep("02-1562L", 5))
chr <- c(rep(c(1:5), 3))
start <- c("62908", "8165719", "10580504", "10720851", "17329625",
           "62908", "6140490", "10721354", "25663298", "29856844",
           "23820990", "27653252", "28694395", "28885620", "29271409")
end <- c("8165619", "10573304", "10720717", "16830571", "17921611",
         "6140140", "10708606", "25595935", "29854740", "30376692",
         "27652503", "28686972", "28885509", "29269991", "36702302")
TumorDat <- data.frame(sample, chr, start, end)

Then I have 222 files that contain the datas about SNPs positions. 111 of them are from normal samples and the other 111 are from the tumor ones (for example they are labelled like "01-28TZ.PILEUP" and "01-28R.PILEUP"). The structure is like that:

# SNPs datas
chr <- c(1:5)
pos <- c("909419", "935222", "957898", "1120307", "1147297")
dbsnp <- c("rs28548431", "rs2298214", "rs2799064", "rs7539911", "rs2298212")
snpDat <- data.frame(chr, pos, dbsnp)

These files contains also datas about what is the reference/alternative base of the considered SNP and the number of reads that map on the ref/alt base.

I also have this function that compute the allelic fraction. It looks for the SNPs positions inside each segment region.

# Fun calc_AF
calc_AF <- function(regions, snps) {
  regions$AF <- NA
  for(i in 1:nrow(regions)){
    seg <- regions[i, ]
    idx <- snps[which(seg$chr %in% snps$chr &
                        snps$pos >= seg$start &
                        snps$pos <= seg$end), ]

    list <- getRefAndAltPerc(idx)

    AF <- mean( list$ref / (list$alt + list$ref) )
    regions$AF[i] <- AF    
  }
  return(regions)
}

My problem is how to tell R that it must computes the allelic fractions (AF) by sample. Each file is labelled with the name of the sample.

For example, if the file with SNPs datas is named "00-000450T.PILEUP", R should computes the AF only for the sample "00-000450N" in "TumorDat" file.

I have faint this function, but I'm not able to complete it.

Tlist_files = list.files("/path-to-dir-with-tumor-files/", full.names=T)
for (i in Tlist_files) {
  Tfile <- read.table(i, stringsAsFactors=F, header=T)   
  Tname <- gsub(".PILEUP.PaPI", "", basename(i))
  Nname <-
  Nfile <- read.delim(paste0("/path-to-dir-with-normal-files/", Nname, ".PILEUP"), header=T)

  # calc_AF normal
  NDat <- calc_AF(regionDatN, Nfile)

  # calc_AF tumor
  TDat <- calc_AF(regionDatT, Tfile)

}
SNP genome R • 2.0k views
ADD COMMENT
0
Entering edit mode

Yikes, is there a reason you didn't just do multi-sample variant calling? You'd then have a conveniently formatted VCF file that's rather simple to deal with in R (or any of a number of command line tools).

ADD REPLY

Login before adding your answer.

Traffic: 1169 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6