Dear Biostars,
I am trying to prepare some published data to test a CNV filtration method I am working on. I would really like to use data from Conrad et al (2007), mostly because it is highly cited - and easy to access. Link here https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-142?query=E-MTAB-142.
Unfortunately, Aglient tech is a bit before my time and I am struggling to figure out how to read the count information. Here is a snippet of their data which should be easily loaded into R as a dataframe.
x <- structure(list(FEATURES = c("DATA", "DATA", "DATA"), FeatureNum = 6:8,
Row = c(1L, 1L, 1L), Col = 6:8, SubTypeMask = c(0L, 0L, 0L
), ControlType = c(0L, 0L, 0L), ProbeName = c("A_18_P17027306",
"chr1_165793426_165793473", "A_18_P14570373"), SystematicName = c("chr9:137150180-137150224",
"chr1:165793427-165793473", "chr3:198891339-198891384"),
LogRatio = c(0.08975880656, 0.1139920636, 0.1038222868),
LogRatioError = c(0.0619727653, 0.0625525488, 0.06214855983
), PValueLogRatio = c(0.1475167061, 0.06840327396, 0.09481057346
), gProcessedSignal = c(2550.198, 479.9035, 4755.878), rProcessedSignal = c(3135.688,
623.9445, 6040.224), gProcessedSigError = c(255.087, 48.34247,
475.6231), rProcessedSigError = c(313.597, 62.53483, 604.0367
), gMedianSignal = c(807.5, 188, 1464.5), rMedianSignal = c(1826,
405.5, 3546), gBGMedianSignal = c(38, 38, 38), rBGMedianSignal = c(43,
44, 44), gBGPixSDev = c(7.409993, 7.485912, 7.367351), rBGPixSDev = c(9.274448,
9.2318, 9.213135), gIsSaturated = c(0L, 0L, 0L), rIsSaturated = c(0L,
0L, 0L), gIsFeatNonUnifOL = c(0L, 0L, 0L), rIsFeatNonUnifOL = c(0L,
0L, 0L), gIsBGNonUnifOL = c(0L, 0L, 0L), rIsBGNonUnifOL = c(0L,
0L, 0L), gIsFeatPopnOL = c(0L, 0L, 0L), rIsFeatPopnOL = c(0L,
0L, 0L), gIsBGPopnOL = c(0L, 0L, 0L), rIsBGPopnOL = c(0L,
0L, 0L), IsManualFlag = c(0L, 0L, 0L), gBGSubSignal = c(772.789,
146.17, 1455.8), rBGSubSignal = c(1831.5, 366.544, 3568.53
), gIsPosAndSignif = c(1L, 1L, 1L), rIsPosAndSignif = c(1L,
1L, 1L), gIsWellAboveBG = c(1L, 1L, 1L), rIsWellAboveBG = c(1L,
1L, 1L), SpotExtentX = c(49.8279, 47.5395, 47.8731), gBGMeanSignal = c(37.7605,
37.804, 37.9131), rBGMeanSignal = c(43.431, 44.9921, 44.1818
)), row.names = 6:8, class = "data.frame")
I am hoping to wrangle this data into something like a standard .BED file format for CNVs with the following column: Chromosome, Start, End, Type, Value.
The first three columns can be extracted from column 8 (SystematicName), but I am struggling to make sense on how I can ascertain the type (Deletion of Duplication), or Value (0, 1, 2, 3, 4, >4), as you would expect from modern CNV callers from WES/ WGS.
I assume the final few columns e.g. gBGMeanSignal and rBGMeanSignal might be valuable here as they seem to show normalised abundance values, but I am unsure weather to average them or add them together.
Any guidance would be most welcome. Also I see there is a p-value column - I assume it can be used to filter out values of low confidence?
Many Thanks, Krutik