I have a data set for 11,000 individuals and would like to filter out related individuals and also based on a phenotype.
I have a vcf file format and prefer to work with Plink, Hail or vcftools?
I have a data set for 11,000 individuals and would like to filter out related individuals and also based on a phenotype.
I have a vcf file format and prefer to work with Plink, Hail or vcftools?
You have 11,000 individuals in a VCF file? - that's impressive. VCFTools does have a relatedness function that it calculates pairwise for all samples. It generally seems to work fine, as I tried it on data that contained monozygotic twins once; however, in the context of false-negatives and false-positives that plague NGS, not even biologicaly monozygotic twins look the same after their DNA has passed through a 'next gen' sequencer and analysis pipeline.
I would favour getting the data into PLINK Format, which provides analyses across all sorts of things including relatedness. To do that, first gzip and tab-index your VCF and then normalise it:
#1st pipe, splits multi-allelic calls into separate variant calls
#2nd pipe, left-aligns indels and issues warnings when the REF base in your VCF does not match the base in the supplied FASTA reference genome
/Programs/bcftools-1.3.1/bcftools norm -m-any My.gz.vcf | /Programs/bcftools-1.3.1/bcftools norm -Ob --check-ref w -f human_g1k_v37.fasta > My.Norm.bcf ;
This will keep the VCF ID field as it is (assuming it's rs ID?). If you want to set it to something unique, use:
/Programs/bcftools-1.3.1/bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%POS:%REF:%ALT'
To read this into PLINK:
/Programs/plink1.90/plink --noweb --bcf My.Norm.bcf --keep-allele-order --indiv-sort file IDsort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out MyPlinkData ;
It's critical when reading VCF/BCF into PLINK that you fix the order of the samples, because in virtually all cases PLINK will not order the samples as they appear in the VCF/BCF. You do this by specifying --indiv-sort file IDsort.list
. IDsort.list contains a listing of FID and IID. It's easier to keep FID as 0 here:
0 16422
0 16427
0 16451
0 17412
0 17422
0 17413
If you have families in your data and want to do family-specific analysis, you'll have to reheader your VCF/BCF to have sample IDs as FID_IID, change --const-fid
above to --id-delim _
, and also modify the IDsort file:
Family1 16422
Family1 16427
Family2 16451
Family2 17412
Family3 17422
Family3 17413
Based on your ID sort file, you should also create a custom FAM file in which the samples are ordered the same. When you read data from VCF/BCF into PLINK, it won't know he gender or case control status.
In all analyses, you then just include your custom FAM as follows:
#Perform family TDT comparisons
/Programs/plink1.90/plink --noweb --bfile MyPlinkData --tdt --model mperm=500 --ci 0.95 --fam Custom.fam ;
Other people will hopefully have other suggestions.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Kevin thank you very much, however; what I want to do is to: 1.Remove related individuals 2.Remove individuals with a specific phenotype for which I have a txt file. If I understand correctly your code changes vcf to plink files only right? How can I filter out individuals as in 1,2.? Also, I don't have families I would be grateful four your advice.
Hello, it would have been better to have replied as a comment to my answer / Mejor hubiera sido escrito su comentario a mi respuesta encima.
You mentioned Plink in your question, so I presumed that you already knew how to use Plink. If you get your dataset in Plink format (with my code), you can then identify related individuals by following the procedures here: http://zzz.bwh.harvard.edu/plink/strat.shtml
For removing (or keeping) certain individuals, take a look here: https://www.cog-genomics.org/plink/1.9/filter
If you want to avoid Plink, just use BCFtools to remove individuals (if you already know who you want to keep/remove):
SamplesToKeep.list just contains a single column of IDs that you want to keep