Hello All,
Let's say that some whole genome sample was sequenced with a coverage of 30x. As far as i'm aware, this means that, with respect to the reference genomes' nucleotides, the data represents each nucleotide 30 times on average.
Let's also say that the tissue sample was heterozygous for some loci, where the frequency of the two alleles are both 0.5. Does this mean that coverage for each of these locations are, in effect 15x? I.e if you aligned the data (and it aligned correctly), you would expect to see ~15 reads with allele 1 and ~15 with allele 2.
N.B
I ask because I am trying to make a simulated cancer genomics dataset. For this I am using ART, and have "mutated" the hg19.fa file, by introducing some point mutations. This mutated file with represent one haploid set, whilst the non-mutated hg19.fa file will represent the other haploid set; this should add realistic point mutations, which are usually heterozygous in nature.
I then plan to sequence at 30x, so, I was going to run ART for each file at 15x and then combine to get 30x. Any thoughts?
Thanks, Izaak
One additional comment on your plan. Somatic variants will occur on both copies of the genome, so your plan to mutate one fastq and not the other, then generate reads, is flawed. (your mutations will always be seen in phase, where phasing is resolvable)
Hmm, thanks Chris. Yes, I figured it wasn't particularly realistic allowing all mutations to occur on a single fastq. However, apart from all mutations occurring on one fastq representing the highly unlikely occurrence that all somatic mutations would occur on one haploid set within a real genome, I did not think such a bias would incorrectly model the likely occurrence case; where mutations occur on both fastq files.
My aim is to generate the simplest data that correctly captures what I am considering to be the basic case: Mutations, with no genetic variation (amongst the healthy data, cancer data and reference) and no errors.
With such data, I can simplify the problem during the development of early versions of the algorithm. However, in order to do so the data must not incorrectly deviate from the true cases, rather, it must be a subset the true cases complexity; otherwise I will likely implement some incorrect control logic in the algorithm.
I did not know about phasing, would you mind explaining what it is? Does phasing occur due to a bias for mutations being identified on one haploid set of the genome, such as in the case I have crudely modelled? Exactly which sub-routine of a variant calling algorithm does phasing effect? Alignment? Or post-alignment analysis.
Could you point me at any reviews about trying to simulate NGS data? Or something related to phasing and other problems commonly encountered when sequencing?
Thanks for your help!
From what I can tell (http://gatkforums.broadinstitute.org/gatk/discussion/45/purpose-and-operation-of-read-backed-phasing) suggests phasing, if the site author and you are talking about the same phasing, occurs after variant calling using the vcf and a sam file.
If, as they suggest, phasing (or resolving phasing / finding the most likely haplotype; I can't grasp the terminology) occurs after the execution of a variant calling algorithm (VCA) I can only see a single case where a VCA need to consider phasing in its control logic; during alignment, as this is when the sam file is generated.
I am using bwa as the alignment sub-routine in our algorithm, and therefore, I imagine that the remaining VCA does not need to consider phasing. Or am I completely wrong, please let me know! :)
To be more realistic, you could do 3 mutation steps:
Now you have 30x coverage with a 1/1000 mutation rate, a quarter of which is homozygous. I'm not sure about the exact ratio in humans but I think it's something like that.
Thanks Brian. Yep, sounds more realistic.