We are a group of ~ 20 rising sophomore and juniors that are interested in group learning of new and interesting concepts in genetics and genomics. We seek your help with answers to the following questions please, about analyzing human genome sequences.
A little context: One student in our group has a Senegalese father and a Japanese mother. They had their genomes sequences 30X coverage shotgun, and very generously shared their data as the following filetypes with us - FASTQ, CRAM, CRAI, VCF and TBI.
Our questions are:
Is there a detailed tutorial you would recommend that can we use to predict disease states, by comparing VCF file (given to us) versus ClinVar database? Is it possible to do this via locally installed software and database(s)?
Does having parents with different ethnicities complicate use and/or interpretation of ClinVar database?
Is there a detailed tutorial on how to convert CRAM file to genome sequence? This would require us to know which reference was used to align to, in order to convert alignments back to sequences, right?
For human NGS - Illumina based FASTQ sequences, is there a standard pipeline for de novo genome assembly without a reference? If yes, then please share link(s) and tutorials. Thank you.
For any given assembled human genome, is there a standard pipeline for genome annotation? If yes, then please share link(s) and tutorials. Thanks again.
Through some postdocs we know, we have access to some HPCC accounts, so we can run >10cpus at a time, with > 100GB memory.
Thanks in advance for your advice, suggestions and sharing relevant links to software and tutorials.
I don't think 30X is good enough coverage to make clinically accurate determinations. Also, anything even remotely accurate needs to be vetted by doctors and clinical genetics counselors, even for simple single-gene disorders, as no genotype is associated with a fixed phenotype to a "set in stone" level. We learn new information every day, and ClinVar doesn't really measure up to a clinically usable database.
The questions you ask above need a team of full time experts to consult and explain, it's not something you can expect from an online forum of volunteers.
Thank you for your response. Is there a scientific consensus about the minimum acceptable fold coverage for sequencing in order to draw clinically related conclusions? And is there an open source database like ClinVar that folks use and prefer over ClinVar? Thanks again.
You could try HGMD (which is manually curated with information taken from publications), which is IMO a tad better than CLINVAR, but I doubt that will make a difference. I'm not sure of the preferred coverage for clinical-level accuracy, but mutation data alone cannot predict too many diseases.
In any case, you may want to restrict yourself to pathogenic entries from CLINVAR - ideally, only those that do not have conflicting evidence, where every piece of evidence points to the mutation being pathogenic.
Thank you, gonna use recommendations from you and JC to learn new concepts, may take us at least a few weeks of learning from tutorials with some small and smple test cases to even start the analysis we envision. At that time, we will post any follow up questions / doubts. Also, we think it may be better for us to start with some data that is higher coverage ~ 100X rather than get stuck with a genome assembly or VCF file that will be a hurdle in us learning these analyses. So if you have any suggestions for such a test genome that is open source for download and use, please share. Thanks again.
You can search SRA for datasets at that level of coverage, but I am not sure if you'll find any clinical grade dataset.