GWAS and vcf files. Lacking Phenotype
1
0
Entering edit mode
24 months ago
BQ • 0

Hi, I am very new to this area, and I am taking a class about bioinformatics. For an independent project assignment, I need to do a GWAS. I am using the bash terminal. I downloaded all the fastq I need, trimmed them, and converted them into bam/sam then vcf then bed/bim/fam etc. However, when I tried to perform GWAS in plink, I realized I dont have phenotype data. It supposed to have two phenotypes.

Basically there are two groups/phenotypes of fastq files, each containing 29 samples. Let's say they are group 1 and 2. For each group, I converted every fastq to sam then bam, then I combined 29 bam to one bam. Then I combined two bams (for the two groups) together to a vcf.gz. Then there is no phenotype data in the following plink files.

Would really appreciate any help! like which step I might have been wrong, or what I should do to incorporate the phenotype data. Ultimately this is only an assignment, so I dont have to be perfect at every detail (like the QC steps), and I am afraid I cannot understand too complicated codes. I just want to go to the end and get a Manhattan plot or something. If there is another pipeline to do so that's also fine.

vcf bam GWAS phenotype • 1.8k views
ADD COMMENT
0
Entering edit mode

Please don't put 'urgent' in all caps. Your question is no more important than anyone else's. The error is that you combined the .bams prior to variant calling. I think you should have called variants separately for each sample and then run a GWAS on those variants.

ADD REPLY
0
Entering edit mode

Sorry for the confusion and wording, and thank you so much for the response! I see your point, so I will try to create vcf files for the two groups seperately. What should I do after that? Is there a way to run plink with two vcf files? Or how should I combine the two vcf while incorporating the phenotypes?

ADD REPLY
0
Entering edit mode
24 months ago
4galaxy77 2.9k

It's easiest to start from the GWAS analysis and work backwards. Maybe your teacher has a different idea of how you should do this, but this is how I (and most people would approach this).

For a GWAS in plink, you need a single VCF file (in fact you should convert to .pgen after you get your VCF, but this step is easy) which contains the genotypes of all the samples, and a phenotype file which tells you the phenotype of all the samples in the vcf.

To obtain this 'multisample' VCF, you need to call variants, i.e. go from fastq -> bam -> vcf for each individual separately, and then merge each single VCF into a multisample vcf.

ADD COMMENT
0
Entering edit mode

I am also new to GWAS and i have a query if you don't mind. I did Fastqc>sam>bam>sorted bam>vcf and finally i merged all the vcf. I checked both the individual vcf's and the merged one and phenotype data is missing in both vcf. I used GATK pipeline for generating the vcf file

ADD REPLY

Login before adding your answer.

Traffic: 1843 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6