Question

GWAS and vcf files. Lacking Phenotype

0

Entering edit mode

2.0 years ago

BQ • 0

Hi, I am very new to this area, and I am taking a class about bioinformatics. For an independent project assignment, I need to do a GWAS. I am using the bash terminal. I downloaded all the fastq I need, trimmed them, and converted them into bam/sam then vcf then bed/bim/fam etc. However, when I tried to perform GWAS in plink, I realized I dont have phenotype data. It supposed to have two phenotypes.

Basically there are two groups/phenotypes of fastq files, each containing 29 samples. Let's say they are group 1 and 2. For each group, I converted every fastq to sam then bam, then I combined 29 bam to one bam. Then I combined two bams (for the two groups) together to a vcf.gz. Then there is no phenotype data in the following plink files.

Would really appreciate any help! like which step I might have been wrong, or what I should do to incorporate the phenotype data. Ultimately this is only an assignment, so I dont have to be perfect at every detail (like the QC steps), and I am afraid I cannot understand too complicated codes. I just want to go to the end and get a Manhattan plot or something. If there is another pipeline to do so that's also fine.

vcf bam GWAS phenotype • 1.8k views

ADD COMMENT • link updated 8 weeks ago by K • 0 • written 2.0 years ago by BQ • 0

0

Entering edit mode

Cross-post https://bioinformatics.stackexchange.com/questions/20071/urgent-help-needed-with-gwas-and-vcf-files-lacking-phenotype

ADD REPLY • link 2.0 years ago by M__ ▴ 200

0

Entering edit mode

Please don't put 'urgent' in all caps. Your question is no more important than anyone else's. The error is that you combined the .bams prior to variant calling. I think you should have called variants separately for each sample and then run a GWAS on those variants.

ADD REPLY • link 24 months ago by 4galaxy77 2.9k

0

Entering edit mode

Sorry for the confusion and wording, and thank you so much for the response! I see your point, so I will try to create vcf files for the two groups seperately. What should I do after that? Is there a way to run plink with two vcf files? Or how should I combine the two vcf while incorporating the phenotypes?

ADD REPLY • link 24 months ago by BQ • 0

score 0 · Answer 1 · 2022-11-26

0

Entering edit mode

24 months ago

4galaxy77 2.9k

It's easiest to start from the GWAS analysis and work backwards. Maybe your teacher has a different idea of how you should do this, but this is how I (and most people would approach this).

For a GWAS in plink, you need a single VCF file (in fact you should convert to .pgen after you get your VCF, but this step is easy) which contains the genotypes of all the samples, and a phenotype file which tells you the phenotype of all the samples in the vcf.

To obtain this 'multisample' VCF, you need to call variants, i.e. go from fastq -> bam -> vcf for each individual separately, and then merge each single VCF into a multisample vcf.

ADD COMMENT • link 24 months ago by 4galaxy77 2.9k

0

Entering edit mode

I am also new to GWAS and i have a query if you don't mind. I did Fastqc>sam>bam>sorted bam>vcf and finally i merged all the vcf. I checked both the individual vcf's and the merged one and phenotype data is missing in both vcf. I used GATK pipeline for generating the vcf file

ADD REPLY • link 8 weeks ago by K • 0