Dear Biostars,
Currently, I'm working in piRNA expression in different cell lines and I would like to ask you about the way I can proceed with data transformation and normalization. (follow-up question regarding this Q) Now, the main issue is that in order to enrich for piRNAs we performed periodate treatment: "The PO treatment has been shown to be effective in separating piRNAs from other classes of small RNAs and degradation products of longer mRNA transcripts studies" We have treated libraries with ~10 million reads and untreated with ~45 million reads. In order to find piRNAs in our samples, we used SPORTS1.0 with output: matched reads to the genome and matched reads to small-RNA databases, unmatched reads to the genome and matched reads to small-RNA databases. For every database regarding different small RNA (rRNA, tRNA, piRNA, lncRNA ....) we get a file with the particular reads matched to that database.piRNA file example:
t00000406 617 + piR-hsa-3546 3 CTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIII 1
t00000517 445 + piR-hsa-3454 2 CACGTGTTAGGACCCGAAAGA IIIIIIIIIIIIIIIIIIIII 0
t00000519 439 + piR-hsa-3546 0 CGGCTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIIIIII 0
t00000803 402 + piR-hsa-3454 2 CACGTGTTAGGACCCGAAA IIIIIIIIIIIIIIIIIIIII 0
t00000817 394 + piR-hsa-3454 3 ACGTGTTAGGACCCGAAAGA IIIIIIIIIIIIIIIIIIII 0
t00001363 255 + piR-hsa-3546 4 TGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIII 2
t00001363 255 + piR-hsa-29932 0 TGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIII 2
The second column is the number of reads.
The majority of reads multimap to different piRNAs, so I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome). for the example above my output file is:
piRNA counts
<chr> <dbl>
1 piR-hsa-29932 255
2 piR-hsa-3454 1241
3 piR-hsa-3546 1311
If the above example is for a treated sample, the untreated sample is something like that:
piRNA counts
<chr> <dbl>
1 piR-hsa-29932 3071
2 piR-hsa-3454 12704
3 piR-hsa-3546 12486
In order to adjust each sample for library differences, I performed this kind of data transformation: If we have 5 treated and 5 untreated biological replicates :
treated_1 6774893
treated_2 4973372
treated_3 7667539
treated_4 41842208
treated_5 18115268
untreated_1 17544293
untreated_2 7106260
untreated_3 5542361
untreated_4 5091629
untreated_5 41335714
We pick the "largest" library: treated_4 41842208 and divide reads by each library in order to get an upscaling factor:
tr %>% mutate(factors=max(tr$reads)/reads)
# A tibble: 10 x 3
sample reads factors
<chr> <dbl> <dbl>
1 treated_1 6774893 6.18
2 treated_2 4973372 8.41
3 treated_3 7667539 5.46
4 treated_4 41842208 1
5 treated_5 18115268 2.31
6 untreated_1 17544293 2.38
7 untreated_2 7106260 5.89
8 untreated_3 5542361 7.55
9 untreated_4 5091629 8.22
10 untreated_5 41335714 1.01
Then we multiply every "feature" of each sample with the corresponding factor. If the 1st example is treated_1 then:
mutate(pirna_counts,counts*tr$factors[1])
# A tibble: 3 x 3
piRNA counts upscaledcounts
<chr> <dbl> <dbl>
1 piR-hsa-29932 255 1575.
2 piR-hsa-3454 1241 7665.
3 piR-hsa-3546 1311 8097.
What kind of normalization I could perform between libraries, what kind of batch effect correction should I use and is there any kind of bias that I'm introducing using this upscaling transformation?
What is the biological question, or comparison, you are interested in? Are there other conditions besides control vs treatment?
Well, the first question is "Are piRNAs expressed in these cell lines? If yes, could we check the relative expression in treated/untreated?". There are also other conditions but I cannot report them now.