I have a huge vcf file (2.6 GB) including 1500 individuals from three phenotypes (500 individual from each phenotype). It contains 203846 SNPs and 7174 INDELs. The organism is pneumococcus which is a haploid bacterium. I want to perform a genome wide association study in order to find significant variants associated with any phenotype. Any suggested pipeline for this analysis? I would be happy if someone could address some tutorials here that I can study.
I would normalise the VCF file and convert it to PLINK format as per my thread HERE, and then run standard association tests in PLINK. You will have to create the FAM file yourself in order to define your phenotypes. Pay close attention to my comment after the tutorial regarding the --indiv-sort command-line parameter - if you don't control the ordering when you read in your BCF file to PLINK, then PLINK Will distort your sample order without giving any warning, and your custom FAM file may then be out of sync.
I built an entire pipeline for this process but I'll let you do the piecing together. If you get your VCF file to the normalised BCF stage from my tutorial, which is not difficult, then you can do something like:
plink --noweb --bcf pseudomonas.bcf --keep-allele-order --indiv-sort file MySampleOrder.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out pseudomonas ;
#Clean the dataset by filtering out variants genotyped in just 12.5% individuals
#"To include only SNPs with a 95% genotyping rate (5% missing) use --geno 0.05
plink --noweb --bfile pseudomonas --geno 0.05 --make-bed --out pseudomonas.Clean ;
#Perform different statistical comparisons
plink --noweb --assoc --ci 0.95 --counts --bfile pseudomonas.Clean --fam MyCustom.fam --out pseudomonas.Clean ;
#Filter to select only significant variants (column #9), and/or those not found in controls (column #6)
awk '{if ($9<=0.05) print $0}' pseudomonas.Clean.assoc > pseudomonas.Clean.assoc.filt ;
awk '{if ($6==0 && $9<=0.05) print $0}' pseudomonas.Clean.assoc > pseudomonas.Clean.assoc.filt.NoControls ;
In PLINK, you can also do other tests, including logistic regression. Your sample numbers look pretty good.
Kevin- if you sequenced a mixed population of bacteria that was selected for an improved phenotype (each sequencing read representing a separate individual)- is it possible to identify the significant variant associated with the phenotype? Is that what was done here?
Hey, I am not sure, exactly. The person who posted this original question appeared to want to identify statistically significantly different variants / mutations between their 3 Streptococcus pneumoniae ('pneumococcus') groups. Which type of sequencing have you performed? - cDNA or DNA?
Illumina DNA sequencing - providing about 200x coverage. Hoping that since this is a mixed population that was selected for a specific improvement in phenotype- that the causative variant would be more dominant in the population. I have VCF files that have thousands of variants that I would like to narrow down further.
Hey, I am not quite sure. One would indeed expect the causative variant to have high penetrance in the population, but other 'passenger' variants likely also exist that have equal or greater penetrance. I am not sure of the best way to go further other than setting a cutoff for variant allele frequency and then attempting to provide functional interpretation on the variants. Prokaryotic bioinformatics is not exactly my area though - have you considered asking the question on Biology StackExchange, or even opening up a new question here on Biostars for other ideas?
Thank you very much, seems useful.
Kevin- if you sequenced a mixed population of bacteria that was selected for an improved phenotype (each sequencing read representing a separate individual)- is it possible to identify the significant variant associated with the phenotype? Is that what was done here?
Hey, I am not sure, exactly. The person who posted this original question appeared to want to identify statistically significantly different variants / mutations between their 3 Streptococcus pneumoniae ('pneumococcus') groups. Which type of sequencing have you performed? - cDNA or DNA?
Illumina DNA sequencing - providing about 200x coverage. Hoping that since this is a mixed population that was selected for a specific improvement in phenotype- that the causative variant would be more dominant in the population. I have VCF files that have thousands of variants that I would like to narrow down further.
Hey, I am not quite sure. One would indeed expect the causative variant to have high penetrance in the population, but other 'passenger' variants likely also exist that have equal or greater penetrance. I am not sure of the best way to go further other than setting a cutoff for variant allele frequency and then attempting to provide functional interpretation on the variants. Prokaryotic bioinformatics is not exactly my area though - have you considered asking the question on Biology StackExchange, or even opening up a new question here on Biostars for other ideas?
okay I will do that!