It's surprisingly straight forward but a difficult concept to put into a coherent single sentence :) The easiest explanation is to describe the steps to generate it.
Let's take a small example set of sequences (without quality scores):
CATAAATTCATTTTTTAATAGCTGAGTAGTATTCCATTGTGTAAATGTAC
CGATTCTTGATCTTACAGCACAAGCCATTGCTGTTCTATTCAGGAATTTT
TCTTGATCTTACAGCACAAGCCATTGCTGTTCTATTCAGGAATTTTTCC
GTAAGTTTCAGTGTCTCTGGTTTTATGTGGAGTTCCTTAATCCACTT
ATAGGAATGGATCAATTCGCATTCTTCTACATGATAACAGCCAGTTGTGC
GTCAAAGATCAGGTGACCATAGGTGTGTGGATTCATCTCTGGGTCTTCA
GTGGATTCATCTCTGGGTCTTCAATTCTGTTACATTGGTCTACTTGTCTG
ACCATGCAGTTTTGATCACAATTGCTCTGTAGTACAGTTTTAGGTCCGGC
GATGAATCTGCCGATTGCCCTTTCTAATTCGTTGAAGAATTGAGTTGGAA
CTGGCTAGGACTTCAAGTACAATGTTGAATAGGTAGGGCGAGAGTGGA
To make life easier we'll just consider the first two positions of each read, rather than the whole thing. So we start out with 4 vectors of zeros (one vector for each nucleotide): A = [0, 0], C=[0,0], G=[0,0], and T=[0,0].
We then read in one read at a time and increment these vectors according to the sequences we see. So the first read has a C in position 1 and an A in position two. So, we increment the first position of the C vector (resulting in C=[1,0]) and the second position of the A vector (so now A=[0,1]). We continue doing that for each additional read which results in:
read2: A=[0,1], C=[2,0], G=[0,1], T=[0,0]
read3: A=[0,1], C=[2,1], G=[0,1], T=[1,0]
read4: A=[0,1], C=[2,1], G=[1,1], T=[1,1]
read5: A=[1,1], C=[2,1], G=[1,1], T=[1,2]
read6: A=[1,1], C=[2,1], G=[2,1], T=[1,3]
read7: A=[1,1], C=[2,1], G=[3,1], T=[1,4]
read8: A=[2,1], C=[2,2], G=[3,1], T=[1,4]
read9: A=[2,2], C=[2,2], G=[4,1], T=[1,4]
read10: A=[2,2], C=[3,2], G=[4,1], T=[1,5]
We then divide the results by the number of reads (10 here) and we plot the results. We expect to see flat lines that represent the percentages of A, C, T, and G in the genome. However, there are often biases (particularly at the start of reads), so we perform this analysis to pick that up.
Thanks Ryan..
So what is the ideal case? Is it like say, for position 1 all the four bases are covering 25% of reads?
Well, genomes aren't typically comprised of 25% of each base. The ideal situation would be 4 flat lines with reasonable percentages.
Yes, I understood. Thank you. You easily explained the concept.
DeepS
Sorry for necro'ing this thread, but what exactly constitutes "reasonable percentages" here?
I'm totally new to sequencing concept. Could you help me to suggest some reference papers like other stuffs from which I can understand the concept. Your help will be appreciated.
Thanks