Hi, I used a filtered SNP vcf to run plink using the --recodeAD, --doubleid (my sample names have underscores) and --allow-extra-chr (the name of my chr was not being recognized). My plink.raw file has not only 0, 1, 2 but also many NAs in the samples, distributed randomly. I referred to the manual too, but cant figure out what it implies. It looks something like this:
IND1 0 0 1 0 0 0
IND2 0 0 NA NA 0 0
IND3 0 1 1 0 0 0
This is the relevant part in the manual:
**Using the --recodeAD option generates the file plink-recode.raw:
FID IID PAT MAT SEX PHENOTYPE snp1_2 snp1_HET snp2_G snp2_HET
1 1 0 0 1 1 0 0 2 0
1 2 0 0 2 1 NA NA 1 1
1 3 0 0 1 1 0 0 1 1
1 4 0 0 2 1 1 1 0 0
The column labels reflect the snp name (e.g. snp1) with the name of the minor allele appended (i.e. snp1_2 in the first instance, as 2 is the minor allele) for the additive component. The dominant component ( a dummy variable reflecting heterozygote state) is coded with the _HET suffix.
This file can be easily loaded into R: for example:
d <- read.table("plink.raw",header=T)
For example, for the first SNP, the individuals are coded 1/1, 0/0, 1/1 and 2/1. The additive count of the number of common (1) alleles is therefore: 2, NA, 2 and 1, which is reflected in the field snp1_2. The field snp1_HET is coded 1 for the fourth individual who is heterozygous -- this field can be used to model dominance effect of the allele.
The behavior of the --recodeA and --recodeAD commands can be changed with the --recode-allele command. This allows for the 0, 1, 2 count to reflect the number of a pre-specified allele type per SNP, rather than the number of the minor allele. This command takes as a single argument the name of a file that lists SNP name and allele to report, e.g. if the file recode.txt contained
snp1 1
snp2 A
then
plink --file data --recodeAD --recode-allele recode.txt
would now report in the LOG file
Reading allele coding list from [ recode.txt ]
Read allele codes for 2 SNPs
and the plink.raw file would read
FID IID PAT MAT SEX PHENOTYPE snp1_1 snp1_HET snp2_A snp2_HET
1 1 0 0 1 1 2 0 0 0
1 2 0 0 2 1 NA NA 1 1
1 3 0 0 1 1 2 0 1 1
1 4 0 0 2 1 1 1 2 0
If the SNP is monomorphic, by default the allele code out will be 0 and all individuals will have a count of 0 (or NA). If an allele is specified in --recode-allele that is not seen in the data, similarly all individuals will receive a 0 count (i.e. rather than an error being given).**
The ultimate aim is to generate a genotype heat map in R for the genotype (Het and hom) at each SNP position for all the individuals. I'm completely new to analysis. It would be great if I can get some help on this asap- whether it is correct and how it can be modified. and what the NAs mean. Thanks a lot.
https://en.wikipedia.org/wiki/N/a