Hi, I am sorry for asking this silly question, but I am really confused with the normalization of this data set. I have three fastq files for RNAseq data. I have sampleA, sampleB and sample C. Suppose, the total reads in sampleA = 5 millions, sampleB = 7 millions and sampleC = 8 millions. Now, I have calculated the nucleotide frequency in sequences with 18, 19, 20 and 21 bases long from each of these fastq file. I want to plot the frequency of these A, T,G,C in these sequences and before I do the plotting, I need to normalize the frequency data matrix.
Sample A
seq A C G T
18 123344 922299 255253 832388
19 642245 454252 7424534 323444
20 133455 545543 543344 93322
21 153335 115543 1633345 213333
SampleB
seq A C G T
18 123344 93399 235553 83382
19 644225 245452 7442534 3311444
20 1133455 2335543 225344 22322
21 112335 112243 1622245 213223
Sample C
seq A C G T
18 122222 22219 233553 343388
19 6445 22452 722534 444212
20 33355 545543 543344 93322
21 22235 225543 223345 223333
So in order to normalize, do I just add the total reads in all three (i.e. 5+7+8=20 million reads) and divide all A, T,G,C of each sample ? Or Do I just divide with the total reads of each sample (for example, divide A,T,G,C columns of sample A with 5 million)? How do I get the proportional estimate of nucleotide frequency in each sample? Thank you for your help.
Hi, I suppose the reads are assumed to be coming from a reference genome/ transcriptome. In that case, normalizing by mapping % might be one way.
So instead of counts of A/C/T/G, you could have proportion of A/C/T/G in mapped reads vs. unmapped reads. So, if the base composition is changing, looking at ratio helps make different sized libraries comparable.