I am trying to extract the LRR
and BAF
values from an affymetrix SNP chip without success using linux based tools. I tried to use a small subset in windows designed software called "Axiomâ„¢ CNV Summary Tools Software" and it works perfectly. The problem is that I have a huge dataset and would be impossible to run in windows machine powerful enough.
Let's expose my steps until this point. First, I obtained five tab delimited files which are require to linux and/or windows pipeline (1-3 obtained with APT affymetrix software).
1 - The Axiom calls.txt
or genotype file:
calls <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998 2 2 2
AX-100010999 1 0 1
AX-100011005 0 1 2
AX-100011007 2 2 1
AX-100011008 1 1 2
AX-100011010 2 2 2
AX-100011011 0 1 0
AX-100011012 0 1 0
AX-100011016 0 0 1
AX-100011017 0 0 2'
calls <- read.table(text=calls, header=T)
2 - The confidences.txt
file:
conf<- 'probeset_id sample_1 sample_2 sample_3
AX-100010998 0.00001 0.0002 0.00006
AX-100010999 0.00001 0.00001 0.00001
AX-100011005 0.00007 0.00017 0.00052
AX-100011007 0.00001 0.00001 0.00001
AX-100011008 0.001 0.00152 0.00001
AX-100011010 0.00001 0.00001 0.00002
AX-100011011 0.00004 0.00307 0.00002
AX-100011012 0.00001 0.00001 0.00001
AX-100011016 0.00003 0.00001 0.00001
AX-100011017 0.00003 0.01938 0.00032'
conf <- read.table(text=conf, header=T)
3 - The summary.txt
file:
summ <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998-A 740.33229 655.41465 811.98053
AX-100010998-B 1139.25679 1659.55079 917.7128
AX-100010999-A 1285.67306 1739.03296 1083.48455
AX-100010999-B 1403.51265 341.85893 1237.48577
AX-100011005-A 1650.03408 1274.57594 485.5324
AX-100011005-B 430.3122 2674.70182 4070.90727
AX-100011007-A 411.28952 449.76345 2060.7136
AX-100011007-B 4506.77692 4107.12982 2065.58516
AX-100011008-A 427.78263 439.63541 333.86312
AX-100011008-B 1033.41335 1075.31617 1623.69271
AX-100011010-A 390.12996 350.54456 356.63156
AX-100011010-B 1183.29912 1256.01391 1650.82396
AX-100011011-A 3593.93578 2902.34079 2776.2503
AX-100011011-B 867.33447 2252.54552 961.31596
AX-100011012-A 2250.44699 1192.46116 1927.70581
AX-100011012-B 740.31957 1721.70283 662.1414
AX-100011016-A 1287.9221 1367.95468 1037.98191
AX-100011016-B 554.8795 666.93132 1487.2143
AX-100011017-A 2002.40468 1787.42982 490.28802
AX-100011017-B 849.92775 1025.44417 1429.96567'
summ <- read.table(text=summ, header=T)
4 - The gender.txt
:
gender <- 'cel_files gender
sample_1 female
sample_2 female
sample_3 female'
gender <- read.table(text=gender, header=F)
And finally the map file map.db
in windows (non readable) or map.txt
in linux as follows:
map <- 'Name Chr Position
AX-100010998 Z 70667736
AX-100010999 4 36427048
AX-100011005 26 4016045
AX-100011007 6 25439800
AX-100011008 2 147800617
AX-100011010 1 98919397
AX-100011011 Z 66652642
AX-100011012 7 28180218
AX-100011016 1A 33254907
AX-100011017 5 1918020'
map <- read.table(text=map, header=T)
This is my result in windows based result for sample_1
:
Name Chr Position sample_1.GType sample_1.Log R Ratio sample_1.B Allele Freq
AX-100010998 Z 70667736 BB Infinity 0.675637419295063
AX-100010999 4 36427048 AB 0.101639462657534 0.531373516807123
AX-100011005 26 4016045 AA -0.111910305454305 0
AX-100011007 6 25439800 BB 0.148781943283483 1
AX-100011008 2 147800617 AB -0.293273363654622 0.609503132331127
AX-100011010 1 98919397 BB -0.283993308525307 0.960031843823016
AX-100011011 Z 66652642 AA Infinity 0.00579049667757003
AX-100011012 7 28180218 AA 0.0245684274744242 0.032174599843476
AX-100011016 1A 33254907 AA -0.265925457515035 0
AX-100011017 5 1918020 AA -0.0091211520536838 0
The values from the windows based tool seems to be correct, but in linux output that's is not the case. I am following the steps described at penncnv software (http://penncnv.openbioinformatics.org/en/latest/user-guide/input/) and I log2 transformed my summary.txt
and did the quantile normalization with limma
package using normalizeBetweenArrays(x)
, finishing with the corrsummary.txt
:
corrsum <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998-A 9.804932 9.285738 9.530882
AX-100010998-B 10.249239 10.528922 9.804932
AX-100010999-A 10.528922 10.641862 10.134816
AX-100010999-B 10.641862 8.472829 10.249239
AX-100011005-A 10.804446 10.249239 8.816931
AX-100011005-B 8.835381 11.186266 12.045852
AX-100011007-A 8.542343 8.835381 11.039756
AX-100011007-B 12.045852 12.045852 11.186266
AX-100011008-A 8.816931 8.816931 8.472829
AX-100011008-B 10.134816 9.910173 10.592867
AX-100011010-A 8.472829 8.542343 8.542343
AX-100011010-B 10.374032 10.134816 10.641862
AX-100011011-A 11.593784 11.593784 11.593784
AX-100011011-B 10.012055 11.039756 9.910173
AX-100011012-A 11.186266 10.012055 10.804446
AX-100011012-B 9.530882 10.592867 9.285738
AX-100011016-A 10.592867 10.374032 10.012055
AX-100011016-B 9.285738 9.530882 10.528922
AX-100011017-A 11.039756 10.804446 8.835381
AX-100011017-B 9.910173 9.804932 10.374032'
corrsum <- read.table(text=corrsum, header=T)
Thus I applied:
./generate_affy_geno_cluster.pl calls.txt confidences.txt corrsummary.txt --locfile map.txt --sexfile gender.txt --output gencluster
and
./normalize_affy_geno_cluster.pl --locfile map.txt gencluster calls.txt --output lrrbaf.txt
And my linux based result (lrrbaf.txt
) which must contain LRR
and BAF
information looks like that:
output <- 'Name Chr Position sample_1.LogRRatio sample_1.BAlleleFreq sample_2.LogRRatio sample_2.BAlleleFreq sample_3.LogRRatio sample_3.BAlleleFreq
AX-100010999 4 36427048 -1952.0739 2 -1953.0739 2 -1952.0739 2
AX-100011005 26 4016045 -2245.1784 2 -2244.1784 2 -2243.1784 2
AX-100011007 6 25439800 -4433.4661 2 -4433.4661 2 -4434.4661 2
AX-100011008 2 147800617 -1493.2287 2 -1493.2287 2 -1492.2287 2
AX-100011011 Z 66652642 -4088.2311 2 -4087.2311 2 -4088.2311 2
AX-100011012 7 28180218 -2741.2623 2 -2740.2623 2 -2741.2623 2
AX-100011016 1A 33254907 -2117.7005 2 -2117.7005 2 -2116.7005 2
AX-100011017 5 1918020 -3067.4077 2 -3067.4077 2 -3065.4077 2'
output <- read.table(text=output, header=T)
As showed above the linux result is completely different from windows based results (and make much less sense) and additionally do not contain the GType
column in the output
. Sorry to compose such a long question, but my intention was to make it as reproducible as possible. I would be grateful for any light to solve this problem as well any important remarks about this kind of data that I maybe forgot.