What is the coverage when allele frequency is < 1?
2
2
Entering edit mode
8.1 years ago

Hello All,

Let's say that some whole genome sample was sequenced with a coverage of 30x. As far as i'm aware, this means that, with respect to the reference genomes' nucleotides, the data represents each nucleotide 30 times on average.

Let's also say that the tissue sample was heterozygous for some loci, where the frequency of the two alleles are both 0.5. Does this mean that coverage for each of these locations are, in effect 15x? I.e if you aligned the data (and it aligned correctly), you would expect to see ~15 reads with allele 1 and ~15 with allele 2.

N.B

I ask because I am trying to make a simulated cancer genomics dataset. For this I am using ART, and have "mutated" the hg19.fa file, by introducing some point mutations. This mutated file with represent one haploid set, whilst the non-mutated hg19.fa file will represent the other haploid set; this should add realistic point mutations, which are usually heterozygous in nature.

I then plan to sequence at 30x, so, I was going to run ART for each file at 15x and then combine to get 30x. Any thoughts?

Thanks, Izaak

next-gen ngs coverage sequencing • 3.3k views
ADD COMMENT
1
Entering edit mode

One additional comment on your plan. Somatic variants will occur on both copies of the genome, so your plan to mutate one fastq and not the other, then generate reads, is flawed. (your mutations will always be seen in phase, where phasing is resolvable)

ADD REPLY
0
Entering edit mode

Hmm, thanks Chris. Yes, I figured it wasn't particularly realistic allowing all mutations to occur on a single fastq. However, apart from all mutations occurring on one fastq representing the highly unlikely occurrence that all somatic mutations would occur on one haploid set within a real genome, I did not think such a bias would incorrectly model the likely occurrence case; where mutations occur on both fastq files.

My aim is to generate the simplest data that correctly captures what I am considering to be the basic case: Mutations, with no genetic variation (amongst the healthy data, cancer data and reference) and no errors.

With such data, I can simplify the problem during the development of early versions of the algorithm. However, in order to do so the data must not incorrectly deviate from the true cases, rather, it must be a subset the true cases complexity; otherwise I will likely implement some incorrect control logic in the algorithm.

I did not know about phasing, would you mind explaining what it is? Does phasing occur due to a bias for mutations being identified on one haploid set of the genome, such as in the case I have crudely modelled? Exactly which sub-routine of a variant calling algorithm does phasing effect? Alignment? Or post-alignment analysis.

Could you point me at any reviews about trying to simulate NGS data? Or something related to phasing and other problems commonly encountered when sequencing?

Thanks for your help!

ADD REPLY
0
Entering edit mode

From what I can tell (http://gatkforums.broadinstitute.org/gatk/discussion/45/purpose-and-operation-of-read-backed-phasing) suggests phasing, if the site author and you are talking about the same phasing, occurs after variant calling using the vcf and a sam file.

If, as they suggest, phasing (or resolving phasing / finding the most likely haplotype; I can't grasp the terminology) occurs after the execution of a variant calling algorithm (VCA) I can only see a single case where a VCA need to consider phasing in its control logic; during alignment, as this is when the sam file is generated.

I am using bwa as the alignment sub-routine in our algorithm, and therefore, I imagine that the remaining VCA does not need to consider phasing. Or am I completely wrong, please let me know! :)

ADD REPLY
0
Entering edit mode

To be more realistic, you could do 3 mutation steps:

1) Mutate the genome at 0.025% to make homozygous variations, producing base.fa.
2) Mutate base.fa at 0.075% to produce copy1.fa.
3) Mutate base.fa at 0.075% again to produce copy2.fa.
4) Generate 15x reads from copy1.fa and 15x from copy2.fa.

Now you have 30x coverage with a 1/1000 mutation rate, a quarter of which is homozygous. I'm not sure about the exact ratio in humans but I think it's something like that.

ADD REPLY
0
Entering edit mode

Thanks Brian. Yep, sounds more realistic.

ADD REPLY
2
Entering edit mode
8.1 years ago

Theoretically the coverage per allele would indeed be total coverage/2 (for diploid genomes). However, that's more often not exactly 50%.

ADD COMMENT
0
Entering edit mode

Sure, luckily, the algorithm I am writing is still in the theoretical stages ;)

Also, ART is realistic enough to add coverage variance. Thanks for the help!

ADD REPLY
2
Entering edit mode
8.1 years ago

If you're simulating cancer genome data, you should not neglect the roles that purity, ploidy, and subclonal populations play in determining allele frequency. They can alter the frequency away from 50% quite substantially.

ADD COMMENT
0
Entering edit mode

I can imagine. Thanks for your input. Can you recommend any tools for this case?

I effectively want to specify a set of a mutations where I can determine the location and the mutation structure. Is there any software you can recommend that can simulates errors, ploidy, subclonal effects and hetrozygousity within the various populations?

I

ADD REPLY
1
Entering edit mode

I'd probably start with bam-surgeon https://github.com/adamewing/bamsurgeon

ADD REPLY
0
Entering edit mode

It's also important to consider amplification. Amplified data will diverge from the expected ratio much more than unamplified data.

Assuming a diploid allele and randomly-distributed reads, the probability of an A/B allele ratio can be calculated using binomial distributions.

ADD REPLY
1
Entering edit mode

FWIW, when talking about ploidy, we are talking about copy number (amplification or deletions)

ADD REPLY
0
Entering edit mode

For cancer, perhaps, but my impression is that the OP is trying to simulate normal diploid data, despite his discussion of cancer genomics.

ADD REPLY
0
Entering edit mode

For a start, I am trying to simulate diploid data, where any allelic difference between the two haploid sets are due to somatic mutations - explicitly, I am removing genetic variation between the haploid sets to simplify the algorithmic problem.

Once I this done, I will then increase the complexity of the data, as mentioned, with errors, sub clonal effects and ploidy - ploidy under Chris's definition - and simulated genetic variation.

ADD REPLY

Login before adding your answer.

Traffic: 2106 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6