Hi All,
I am very new to the data analysis of the NGS and struggling with my project. I need help to sort my dataset and the workflow for my analysis. Below mentioned are my inputs on what I plan to do. I'm unsure if my understanding is right. Any kind of feedback and suggestions would help.
I'm trying to figure out the parent of origin effect on the global gene expression. For this, I have to look at the trio dataset (child/mother/father). After checking for the datasets and to begin with the analysis I have downloaded the separate VCF files of child(NA19256
), mother(NA19257
), & father (NA19256
) respectively from 1000 genomes project. This dataset is unphased.
- As next step, I intend to use Beagle for phasing of the files and then using vcftools convert the vcf files to plink format to obtain the ped and map files.
- Then using PREMIM and EMIM of the ped and the map files obtain the parent of origin info.
My intention is to map the child genome to mother's and dad's respectively to identify the contribution of each parent.
Questions:
- Please confirm if I have the right approach?
- Can anyone please suggest other trio datasets that can be used for this analysis?
- If my understanding is incorrect are there any other approaches that I can look up into for this analysis and help get me results faster?
Eagerly in need of help. Feedback and suggestions are highly appreciated.
Thank You.
I sense something weird when the phrases
gene expression
andVCF
occur together - how does one infer anything about expression from a VCF? Would one not need an expression dataset (like RNAseq) for such analysis?If you're trying to phase and pick inherited/transmitted and denovo variants, that is something I know is possible with VCF data.
Of course,
parent of origin
could be something specific, invalidating my entire premise. If that is the case, please do help us understand how theparent of origin
analysis works.Well, I forgot to mention that I have the fastq RNA seq data of the child which I should compare to the heterozygous snp's extracted from mom and dad vcf files. Considering this extra information can you please confirm if my approach would work?
How are you planning on comparing RNAseq to parental variants? I'm assuming you'll be extracting expression from the RNAseq. It is possible to phase and find out inherited/denovo variants in the child, but comparing mutations in the parents to expression in the child is a new concept to me.
One thing you might want to deal with is reads that map equally well to both the maternal and paternal genomes. You probably want to exlcude these reads. One way to do that would be to create a diploid offspring genome using your phasing information, and then map to that, disallowing multi-mappers.
An alternative would be to map to your two haploid genomes, sort the resulting BAM file by read name, and then iterate over the BAMs simlatanuously, and then for each read, only selecting the read with the higher MAPQ, and discarding reads where the MAPQ is the same.