Question

Normalization for small non-coding RNA (piRNAs)

1

Entering edit mode

6.0 years ago

Konstantinos Yeles ▴ 120

Dear Biostars,

Currently, I'm working in piRNA expression in different cell lines and I would like to ask you about the way I can proceed with data transformation and normalization. (follow-up question regarding this Q) Now, the main issue is that in order to enrich for piRNAs we performed periodate treatment: "The PO treatment has been shown to be effective in separating piRNAs from other classes of small RNAs and degradation products of longer mRNA transcripts studies" We have treated libraries with ~10 million reads and untreated with ~45 million reads. In order to find piRNAs in our samples, we used SPORTS1.0 with output: matched reads to the genome and matched reads to small-RNA databases, unmatched reads to the genome and matched reads to small-RNA databases. For every database regarding different small RNA (rRNA, tRNA, piRNA, lncRNA ....) we get a file with the particular reads matched to that database.piRNA file example:

t00000406   617     +   piR-hsa-3546    3   CTGTTAACCGAAAGGTTGGTGGT     IIIIIIIIIIIIIIIIIIIIIII 1
t00000517   445     +   piR-hsa-3454    2   CACGTGTTAGGACCCGAAAGA   IIIIIIIIIIIIIIIIIIIII 0
t00000519   439     +   piR-hsa-3546    0   CGGCTGTTAACCGAAAGGTTGGTGGT IIIIIIIIIIIIIIIIIIIIIIIIII 0
t00000803   402 +   piR-hsa-3454    2   CACGTGTTAGGACCCGAAA IIIIIIIIIIIIIIIIIIIII   0
t00000817   394 +   piR-hsa-3454    3   ACGTGTTAGGACCCGAAAGA    IIIIIIIIIIIIIIIIIIII    0
t00001363   255 +   piR-hsa-3546    4   TGTTAACCGAAAGGTTGGTGGT  IIIIIIIIIIIIIIIIIIIIII  2
t00001363   255 +   piR-hsa-29932   0   TGTTAACCGAAAGGTTGGTGGT  IIIIIIIIIIIIIIIIIIIIII  2

The second column is the number of reads.

The majority of reads multimap to different piRNAs, so I took the sum of reads assigned to each piRNA (both unmatched/matched reads to the genome). for the example above my output file is:

piRNA         counts
  <chr>          <dbl>
1 piR-hsa-29932    255
2 piR-hsa-3454    1241
3 piR-hsa-3546    1311

If the above example is for a treated sample, the untreated sample is something like that:

    piRNA         counts
      <chr>          <dbl>
    1 piR-hsa-29932   3071
    2 piR-hsa-3454    12704
    3 piR-hsa-3546    12486

In order to adjust each sample for library differences, I performed this kind of data transformation: If we have 5 treated and 5 untreated biological replicates :

treated_1   6774893
treated_2   4973372
treated_3   7667539
treated_4   41842208
treated_5   18115268
untreated_1   17544293
untreated_2 7106260
untreated_3 5542361
untreated_4 5091629
untreated_5 41335714

We pick the "largest" library: treated_4 41842208 and divide reads by each library in order to get an upscaling factor:

tr %>% mutate(factors=max(tr$reads)/reads)
# A tibble: 10 x 3
   sample         reads factors
   <chr>          <dbl>   <dbl>
 1 treated_1    6774893    6.18
 2 treated_2    4973372    8.41
 3 treated_3    7667539    5.46
 4 treated_4   41842208    1   
 5 treated_5   18115268    2.31
 6 untreated_1 17544293    2.38
 7 untreated_2  7106260    5.89
 8 untreated_3  5542361    7.55
 9 untreated_4  5091629    8.22
10 untreated_5 41335714    1.01

Then we multiply every "feature" of each sample with the corresponding factor. If the 1st example is treated_1 then:

     mutate(pirna_counts,counts*tr$factors[1])
# A tibble: 3 x 3
     piRNA         counts upscaledcounts
  <chr>          <dbl>          <dbl>
1 piR-hsa-29932    255          1575.
2 piR-hsa-3454    1241          7665.
3 piR-hsa-3546    1311          8097.

What kind of normalization I could perform between libraries, what kind of batch effect correction should I use and is there any kind of bias that I'm introducing using this upscaling transformation?

normalization DEseq2 edgeR limma • 2.4k views

ADD COMMENT • link updated 6.0 years ago by A. Domingues ★ 2.7k • written 6.0 years ago by Konstantinos Yeles ▴ 120

0

Entering edit mode

What is the biological question, or comparison, you are interested in? Are there other conditions besides control vs treatment?

ADD REPLY • link 6.0 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

Well, the first question is "Are piRNAs expressed in these cell lines? If yes, could we check the relative expression in treated/untreated?". There are also other conditions but I cannot report them now.

ADD REPLY • link 6.0 years ago by Konstantinos Yeles ▴ 120

1

Entering edit mode

6.0 years ago

A. Domingues ★ 2.7k

edit: this should have been a comment. It's late.

But how do we now that these kinds of reads derive from piRNAs?

I would thread carefully in the case of piRNA derived from protein-coding genes. But you can still, and very naively count reads mapping to annotated genomic features such as rRNA, miRNA, repeats, etc. If a read maps to transposons, has a certain length (usually 25-30 depending on the species), starts with a U (T) or has an A position 10, it is quite likely to be a piRNA. You can always see if it originates from a cluster, but I am not sure how well those annotated in human. Also, there is extensive literature of piRNA biology in Drosophila but don't assume all mechanics will also work in the same way in other species because it probably won't, though the broad functionality is conserved.

Ultimately there is always a degree of uncertainty when studying piRNAs that you have to accept. I urge to read the literature, in particular the mouse model system which might be the closest to human, and see what are the common data analysis practices in the field. Also ask people in your lab about piRNA biology.

ADD COMMENT • link 6.0 years ago by A. Domingues ★ 2.7k

0

Entering edit mode

6.0 years ago

A. Domingues ★ 2.7k

rRNA, miRNA, tRNA

I forgot to add that based simply on mapping to these annotated features, it is quite likely the reads are not originating from piRNAs.

ADD COMMENT • link 6.0 years ago by A. Domingues ★ 2.7k

score 4 · Accepted Answer · 2018-12-03

(this is a bit too long for a comment so adding it as an answer)

Judging also from your post in BioC it seems like you are interested in a sort non-standard analysis, or at least one that I am not familiar with. First things first.

Are piRNAs expressed in these cell lines?

A better way to answer this is question is to do a RIP of one or more of the core components of the piRNA pathways in humans ( Hiwi, Hili and Hiwi2?), followed by small RNA-seq and see if the profile of the small RNAs matches what is know:

RNA length profile
1U and 10A bias
ping-pong signal

I would use all mapped reads, and not a subset of those reads that match a database. To be fair you can already do this with the small RNA you have, but showing that there are indeed bound to a protein of the pathway is much more convincing in a new system.

If yes, could we check the relative expression in treated/untreated?

For this I would do the same piRNA characterization as about, normalizing to mapped reads, and see if the signal is better in the treated libraries. Not particularly quantitative but if the treatment really is effective in enriching for piRNAs, it will pass the eyeball test.

If you want to go down the route of using small RNA databases and DESeq2, then count each read only once (otherwise you are creating data that doesn't exist) and use the read counts for a fairly standard DESeq analysis. Two things to keep in mind:

assuming the treatment is really effective the library composition will be biased and using only the table of piRNAs counts might (or) will violate a few assumptions of DESeq2, specially that the majority of genes will not change with treatment. Using all small RNAs might be able to alleviate this global effect.
This analysis will tell you which piRNAs change with treatment, but I don't know if you can say anything about enrichment of piRNAs in general, but maybe someone with a better knowledge of statistics can chime in on this.

So, I would keep it simple and normalize to mapped reads for general small RNA composition / properties analysis, though some people normalize to miRNAs or other class of small non-coding RNAs, and above all keep it simple and use tried and trusted analysis / tools (as suggested in a previous answer). If the results are promising and hold to some basic predictions and assumptions about piRNAs, explore a bit more.

edit:

Maybe I misunderstood, but does this mean that a read is counted more than once? Each read should be counted only once. If you want to use multi-mapping reads the best options are to randomly select a piRNA or weight it - read matches two piRNAs, each gets 0.5 counts.

No, you didn't misunderstand. I've also posted an informative example in Biostars . I don't want to choose randomly a piRNA because it may be misleading. Using weights is a possibility but what about a read that matches 5 piRNAs such as these: piR-51199 TGCCAAACTAAGCAAGGTCACGTGTGA piR-51200 TGCCAAACTAAGCAAGGTCACGTGTGAA piR-51201 TGCCAAACTAAGCAAGGTCACGTGTGAAG piR-51202 TGCCAAACTAAGCAAGGTCACGTGTGAAGA piR-51203 TGCCAAACTAAGCAAGGTCACGTGTGAAGG It's one will get 0.2 counts. Is it the correct way to "counter" multi-mapping?

First of all I would argue that you your read corresponds to only one of those piRNAs. It has a defined sequence and length, so it can only be an 100% reciprocal match with one of those 5 sequences:

read TGCCAAACTAAGCAAGGTCACGTGTGAA

piR-51199 TGCCAAACTAAGCAAGGTCACGTGTGA

piR-51200 TGCCAAACTAAGCAAGGTCACGTGTGAA

piR-51201 TGCCAAACTAAGCAAGGTCACGTGTGAAG

piR-51202 TGCCAAACTAAGCAAGGTCACGTGTGAAGA

piR-51203 TGCCAAACTAAGCAAGGTCACGTGTGAAGG

Yes it has a partial match with the others, but not a full match.

Now if we are talking about multimapping in terms of genomic location, in my opinion there is no "right" way* to solve the issue, since they all involve a compromise between sensitivity and specificity. For most of the things I do I don't care where a piRNA maps, as long as it maps to a location in the genome. So I pick __one location__ randomly. For some specific purposes, I need to know the exact mapping location and if can't get it, though luck, throw away all multimappers. I never used weights because I see little advantage in it, but it is an acceptable compromise.

*However there is a bad choice: using all mapped locations for a read. In the case you mention, one read mapping to multiple piRNAs your data will blow up from one RNA that was actually sequence to 5. If it maps to 100 different locations you are suddenly analysing hundreds of reads or RNAs that were not cloned at all.

I hope this helps a little.