Dear Biostars, I'm a new PhD student and my main project is the identification/Differential Expression of piRNAs in Human colorectal cancer (CRC) cell lines with smallRNA-seq. I would like to ask about the bioinformatic analysis workflow and which databases/libraries for different small RNAs I have found various strategies regarding the identification of piRNAs and I will follow this kind of workflow:
- Use bowtie for sequence alignment to map reads to the genome (hg38 iGenome)
- annotate reads to rRNA library (?which database)
- annotate remaining reads to mature and hairpin libraries (miRNAs) (miRBase)
4a
- annotate remaining reads to tRNA library (GtRNAdb)
- annotate remaining reads to Rfam library (snRNA, snoRNA, lncRNA)
- annotate remaining reads to piRNA database (piRBase or piRNABank)
4b
- annotate remaining reads to tRNA library (GtRNAdb)
- annotate remaining reads to Rfam library (snRNA,snoRNA,lncRNA)
- annotate remaining reads to piRNA cluster database (piRNAclusterdb)
Three publications have shown that piRNA could derive also from tRNAs: 1) The biogenesis pathway of tRNA-derived piRNAs in Bombyx germ cells 2) The human Piwi protein Hiwi2 associates with tRNA-derived piRNAs in somatic cells 3) tRNA processing defects induce replication stress and Chk2-dependent disruption of piRNA transcription.
Thus, I should change the order of annotation libraries:
alternative 4a
- annotate remaining reads to Rfam library (snRNA, snoRNA, lncRNA)
- annotate remaining reads to piRNA database (piRBase or piRNABank)
- annotate remaining reads to tRNA library (GtRNAdb)
alternative 4b
- annotate remaining reads to Rfam library (snRNA, snoRNA, lncRNA)
- annotate remaining reads to piRNA cluster database (piRNAclusterdb)
- annotate remaining reads to tRNA library (GtRNAdb)
Also, I examine the possibility of re-mapping reads against transposon library (RepBase) found in piRNADBs in order to "filter" annotated results of piRBase/piRNABank libraries to a more robust final piRNAs dataset.
A) Which database/library should I use for the annotation of rRNA? I have read these posts: cannot find biostar stackexchange Human Non Coding Rrna Sequences For Download SEQanswers but it is not clear to me which one is the best option for annotation library... pardon my naiveness.
B) Next step would be to normalize counts, I have read about /RPM/RPKM/TPM and other types of normalization but which one should be more robust to use for small-RNA counts and which one should be used for piRNA clusters? Because of the lenght of piRNA clusters ~(5k-60k bp) i think I should use TPM normalization to show the relative abundance in different cell lines, is this correct?
C) Regarding mature piRNAs for DE analysis should I follow the workflows of edgeR, DEseq2 or limma voom?
D) Does this workflow seems robust or it needs corrections?
Thank you for your time and consideration
Konstantinos
In regards to C)
One thing to keep in mind is that piRNA are thought to be one of the most abundant class of small RNA in animals. As another user commented many of these annotations could actually be degradation products of other RNA or simply misclassification. On top of that there may even be additional piRNA in your dataset that are not annotated.
For DE analysis any of the packages you mentioned should work.
I'm concerned about the misclassification problem and that's why I'd prefer to use experimentally validated mature piRNAs from piRBase with different methods. About novel piRNAs, there are some prediction tools that could help to find that kind of additional information. If I would be able to find the same sequences in all my samples then could I use them for DE analysis?
RNACentral would be a good option for reference data.
Do you think it is better to use it as a reference for all libraries or just for rRNA?
There is a problem with piRNA databases regarding the annotation of some piRNA sequences. In this publication, they showed that there are sequences classified as piRNAs in piRNAbank that match to ncRNAs (rRNAs, tRNAs, YRNAs, snRNAs, and snoRNAs). I believe that if I utilise a database with information about the experimental method used to find piRNAs then I could filter this database and keep piRNAs validated with 2 different methods. Fortunately, the updated form of piRBase has this kind of information.
Thank you for your time