Question

What Is The Statistical Framework To Compare Distribution Of Two Samples Of Genomic Sequences

5

Entering edit mode

14.5 years ago

Jeremy Leipzig 23k

I have two genomic samples from the same source and I want to determine if slight variations in extraction treatment done to them had a significant effect.

Assuming I can align these sequences to a reference - what is the accepted method for judging whether samples A and B are more or less similar to samples A and C or B and C?

I could count the number of identically mapped positions but I would like to temper that with some measure of sample size and diversity.

sequence • 3.0k views

ADD COMMENT • link updated 14.5 years ago by brentp 24k • written 14.5 years ago by Jeremy Leipzig 23k

1

Entering edit mode

Are all locations of equal weight when you measure "significant"? Depending on the bias you expect in your extraction or the downstream analysis, you may want to weight your loci depending on whether they fall in (for example) a known exon, a UTR, an intron, evolutionarily conserved sequence, a CpG island.

ADD REPLY • link 14.5 years ago by David Quigley 11k

0

Entering edit mode

From a machine learning perspective, you may look in to the sequence composition or other nucleotide sequence derived features to perform the comparison (A+T, GC content, di-nucleotide frequency, tri-nucleotide frequency etc.)

ADD REPLY • link 14.5 years ago by Khader Shameer 18k

0

Entering edit mode

You are almost certainly going to want biological and/or technical replicates of each of the two extraction methods. Even using the same extraction methods it's fairly typical to see some variation in your results, and you'll need to account for that.

ADD REPLY • link 14.5 years ago by Wjeck ▴ 490

Ram · Answer 1 · 2010-07-13

2

Entering edit mode

14.5 years ago

brentp 24k

In addition to the answers above, also see the cuffdiff utility in cufflinks and the cufflinks paper.

Specifically see the supplementary methods: sections 3.2, 3.3 and 5.2 ('Testing for changes in absolute expression')

I dont understand all the stats, but they seem to be well documented and encapsulated in the cufflinks/cuffdiff utility.

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.5 years ago by brentp 24k

score 1 · Answer 2 · 2010-07-13

I think you will first need to define a similarity metric, then establish the error of the methodology itself with respect of this metric. Then you could compare the differences across your samples to the the expected error.

For example if you had say three samples A,B and C using the percent of identically matching reads as your metric you could use a 10% cross validation to establish an empirical distribution of identically matching reads (subselect a random 10% of you data) for each of A, B and C. Now you have an expected distribution of errors for each run. You can now compare these distributions and see if these are statistically significant differences between A, B or C.

Now as far as of 10% and what the distribution are like, these are more of a starting guess values, and in fact the whole reasoning may be incorrect, but it could be a good start that can give you a sense of just how consistent the data is.

Ram · Answer 3 · 2010-07-13

Some ideas off the top of my head:

Tile 100kb windows across the genome and count the number of reads that fall into each. Then compare the distributions using a paired t-test. Substitute larger or smaller windows depending on how much data you have.
Another idea - a distance metric based on summing the differences (or ratios) between counts in each of these windows.
Do you expect effects based on GC content? (this is common when amplification steps are involved) If so, you could just create a list of each read's GC content percentage, then compare the lists.