Question

Meaning of Per base sequence content in FastQC

6

Entering edit mode

10.8 years ago

deepthithomaskannan ▴ 410

Hi all,

Can anybody help me to understand the meaning of Per base sequence content in FastQC analysis? I read the definition like "the proportion of each base position in a file for which each of the four normal DNA bases has been called" in the manual. But I couldn't understand the meaning. If anybody can explain the concept to me with a simple example, that will be great help?

Thanks,
DeepS

sequence content fastqc • 29k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by deepthithomaskannan ▴ 410

Ram · Answer 1 · 2014-07-18

19

Entering edit mode

10.8 years ago

Devon Ryan 105k

It's surprisingly straight forward but a difficult concept to put into a coherent single sentence :) The easiest explanation is to describe the steps to generate it.

Let's take a small example set of sequences (without quality scores):

CATAAATTCATTTTTTAATAGCTGAGTAGTATTCCATTGTGTAAATGTAC
CGATTCTTGATCTTACAGCACAAGCCATTGCTGTTCTATTCAGGAATTTT
TCTTGATCTTACAGCACAAGCCATTGCTGTTCTATTCAGGAATTTTTCC
GTAAGTTTCAGTGTCTCTGGTTTTATGTGGAGTTCCTTAATCCACTT
ATAGGAATGGATCAATTCGCATTCTTCTACATGATAACAGCCAGTTGTGC
GTCAAAGATCAGGTGACCATAGGTGTGTGGATTCATCTCTGGGTCTTCA
GTGGATTCATCTCTGGGTCTTCAATTCTGTTACATTGGTCTACTTGTCTG
ACCATGCAGTTTTGATCACAATTGCTCTGTAGTACAGTTTTAGGTCCGGC
GATGAATCTGCCGATTGCCCTTTCTAATTCGTTGAAGAATTGAGTTGGAA
CTGGCTAGGACTTCAAGTACAATGTTGAATAGGTAGGGCGAGAGTGGA

To make life easier we'll just consider the first two positions of each read, rather than the whole thing. So we start out with 4 vectors of zeros (one vector for each nucleotide): A = [0, 0], C=[0,0], G=[0,0], and T=[0,0].

We then read in one read at a time and increment these vectors according to the sequences we see. So the first read has a C in position 1 and an A in position two. So, we increment the first position of the C vector (resulting in C=[1,0]) and the second position of the A vector (so now A=[0,1]). We continue doing that for each additional read which results in:

read2: A=[0,1], C=[2,0], G=[0,1], T=[0,0]
read3: A=[0,1], C=[2,1], G=[0,1], T=[1,0]
read4: A=[0,1], C=[2,1], G=[1,1], T=[1,1]
read5: A=[1,1], C=[2,1], G=[1,1], T=[1,2]
read6: A=[1,1], C=[2,1], G=[2,1], T=[1,3]
read7: A=[1,1], C=[2,1], G=[3,1], T=[1,4]
read8: A=[2,1], C=[2,2], G=[3,1], T=[1,4]
read9: A=[2,2], C=[2,2], G=[4,1], T=[1,4]
read10: A=[2,2], C=[3,2], G=[4,1], T=[1,5]

We then divide the results by the number of reads (10 here) and we plot the results. We expect to see flat lines that represent the percentages of A, C, T, and G in the genome. However, there are often biases (particularly at the start of reads), so we perform this analysis to pick that up.

ADD COMMENT • link 10.8 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks Ryan..

So what is the ideal case? Is it like say, for position 1 all the four bases are covering 25% of reads?

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by deepthithomaskannan ▴ 410

0

Entering edit mode

Well, genomes aren't typically comprised of 25% of each base. The ideal situation would be 4 flat lines with reasonable percentages.

ADD REPLY • link 10.8 years ago by Devon Ryan 105k

0

Entering edit mode

Yes, I understood. Thank you. You easily explained the concept.

DeepS

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by deepthithomaskannan ▴ 410

0

Entering edit mode

Sorry for necro'ing this thread, but what exactly constitutes "reasonable percentages" here?

ADD REPLY • link 4.6 years ago by Dunois ★ 2.9k

0

Entering edit mode

I'm totally new to sequencing concept. Could you help me to suggest some reference papers like other stuffs from which I can understand the concept. Your help will be appreciated.

Thanks

ADD REPLY • link 7.6 years ago by harishkiran.handral • 0

Ram · Answer 2 · 2014-07-22

3

Entering edit mode

10.8 years ago

Ian 6.1k

There is a handy Youtube video by the author of Fastqc that explains the different concepts. Per base sequence content is describe at five minutes into the video.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by Ian 6.1k