I am writing code to count alleles from 23andMe genome text files. The code returns a factor with levels corresponding to allele symbols. I want to assign a number to each genotype. I want to code so that each effect allele is scored as 1 and the other allele as 0. In this case AA=2, AG=1, GG=0. Instead, if I use the as.integer function, it simply assigns the number corresponding to the position among the levels(see bottom of output), but that is not what I want.
As the alleles column (V4) has 19 different levels (corresponding to all the alleles present in the genome) I am interested in only 4 of them for each SNP. How do I assign a numeric value to each of the four genotypes?
setwd("~/genomes")
mydata=read.table("genome_003.txt")
View(mydata)
library(Hmisc)
df=as.data.frame(mydata)
rownumber=match('rs9375195', rs)#returns the first location of SNP
df[rownumber,] #displays row corresponding to SNP
V1 V2 V3 V4 224186 rs9375195 6 98562720 AA
genotype=df[rownumber,]$V4
genotype #displays alleles for corresponding SNP [1]
AA #genotype
Levels: -- A AA AC AG AT C CC CG CT DD DI G GG GT I II T TT > number=as.integer(genotype) > number [1] 3
So what you want is for
genotype=df[rownumber,]$V4
to return2
instead ofAA
?Exactly so!