I wonder whether it exists an easy way to conserve haplotypes when proceeding some basic actions with plink. Is there any options? I couldn't find on my own.
The situation is as following: we have phased data. We use the .ped and .map format. We want to apply some filters (e.g keep SNPs with Minor allele frequency above 5% in our set of individuals, keep a subset of individuals, etc.). But we figured out that plink do not keep the phased in this case. Everything is mixed up in the output files.
When you recode your ped, plink puts the minor frequency allele as A1. plink does not guarantee that it would keep your phase but probably you can keep your alleles the same way you inserted (and keep phase) if you use --keep-allele-order?.
plink2 solved the problems you mentioned here: link
plink2 --vcf chr1.vcf --make-pgen --out chr1
The --pfile flag usually causes the binary fileset prefix.pgen +
prefix.pvar + prefix.psam to be referenced, while --pgen/--pvar/--psam
let you fully name one file at a time. New features supported by these
formats include:
Reliable tracking of REF vs. ALT alleles. Computationally efficient
compression of low-MAF and high-LD variants. Phased genotypes.
Dosages. VCF-style header information (including species-specific
chromosome info, so you don't have to constantly use --chr-set).
Multiallelic variants. Multiple phenotypes. Named categorical
phenotypes.
I also couldn't find any way on PLINK's website to allow you to input a phased format.
Have you tried just getting the report of MAF (--freq), finding what SNPs fail the threshold you set, and then having PLINK remove specific SNPs (--exclude snplist.txt) that way? It isn't quite as elegant but might work.
Thanks Stephanie. I could be an option BUT I think plink, in any case just mix up the phase. if you give the following simple command plink --file input --recode --out output (so nothing is done and there shouldn't be any differneces between input and output) the phases are lost anyway. So, ok there is no way using plink to keep phases. :-S
Basically, I'm new to bioinformatics, and PLINK (obviously). Sorry for asking quite a silly question... the PED files I'm given to be used for analysis are also in the format you mentioned (since they are phase data). Will this interfere with downstream analysis having two haploid ones from the same individual? I don't know if this question makes sense...
I will check it out!