What Is The Statistical Framework To Compare Distribution Of Two Samples Of Genomic Sequences
3
5
Entering edit mode
14.4 years ago

I have two genomic samples from the same source and I want to determine if slight variations in extraction treatment done to them had a significant effect.

Assuming I can align these sequences to a reference - what is the accepted method for judging whether samples A and B are more or less similar to samples A and C or B and C?

I could count the number of identically mapped positions but I would like to temper that with some measure of sample size and diversity.

sequence • 2.9k views
ADD COMMENT
1
Entering edit mode

Are all locations of equal weight when you measure "significant"? Depending on the bias you expect in your extraction or the downstream analysis, you may want to weight your loci depending on whether they fall in (for example) a known exon, a UTR, an intron, evolutionarily conserved sequence, a CpG island.

ADD REPLY
0
Entering edit mode

From a machine learning perspective, you may look in to the sequence composition or other nucleotide sequence derived features to perform the comparison (A+T, GC content, di-nucleotide frequency, tri-nucleotide frequency etc.)

ADD REPLY
0
Entering edit mode

You are almost certainly going to want biological and/or technical replicates of each of the two extraction methods. Even using the same extraction methods it's fairly typical to see some variation in your results, and you'll need to account for that.

ADD REPLY
2
Entering edit mode
14.4 years ago
brentp 24k

In addition to the answers above, also see the cuffdiff utility in cufflinks and the cufflinks paper.

Specifically see the supplementary methods: sections 3.2, 3.3 and 5.2 ('Testing for changes in absolute expression')

I dont understand all the stats, but they seem to be well documented and encapsulated in the cufflinks/cuffdiff utility.

ADD COMMENT
1
Entering edit mode
14.4 years ago

I think you will first need to define a similarity metric, then establish the error of the methodology itself with respect of this metric. Then you could compare the differences across your samples to the the expected error.

For example if you had say three samples A,B and C using the percent of identically matching reads as your metric you could use a 10% cross validation to establish an empirical distribution of identically matching reads (subselect a random 10% of you data) for each of A, B and C. Now you have an expected distribution of errors for each run. You can now compare these distributions and see if these are statistically significant differences between A, B or C.

Now as far as of 10% and what the distribution are like, these are more of a starting guess values, and in fact the whole reasoning may be incorrect, but it could be a good start that can give you a sense of just how consistent the data is.

ADD COMMENT
1
Entering edit mode
14.4 years ago

Some ideas off the top of my head:

  • Tile 100kb windows across the genome and count the number of reads that fall into each. Then compare the distributions using a paired t-test. Substitute larger or smaller windows depending on how much data you have.

  • Another idea - a distance metric based on summing the differences (or ratios) between counts in each of these windows.

  • Do you expect effects based on GC content? (this is common when amplification steps are involved) If so, you could just create a list of each read's GC content percentage, then compare the lists.

ADD COMMENT

Login before adding your answer.

Traffic: 2476 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6