Hi all,
As mentioned before that R is a statistical analysis programming language. Since it is freely available and has a wide range of statistical tests and plotting option, it is widely used in the analysis of bioinformatics data.
R in bioinformatics
For example, there are many libraries that can remove contamination, perform quality checks on fastq files, analyze Next generation sequencing data, calculate the expression of genes, perform differential gene expression (DESEQ or EdgeR), and generate heatmaps, histograms, line plots, venndiagram and other relevant plots. Similarly, Microarray analyses can be done using R language which calculated fold change value after reducing the noise in data one such package is limma. Limma can analyze both microarrays as well as NGS data.
There are a lot of tools written in R which can read files that are generated from various instruments and can’t be read directly as text. Such as ab1 file or BAM files.
Many researchers use R language to calculate the difference in the sample and calculate p-values. A few of the most famous tests used in bioinformatics sample testing are T-test, Z-test, ANOVA, the test of normality and other parametric and non-parametric tests. Machine learning in R is also used as a way to classify and cluster biological data. There are a lot of papers that use R to create classifiers to classify biological data.
Many studies have used R to create mathematical models to predict the dependent and independent variable trends. Using R classification libraries researchers can do text mining saving a lot of time in manual curation. To found the relationships between various samples R is also widely used to calculate pairwise and multiple correlations.
R is also used to create plots that are used in publications. There is a separate package which uses R statistical programming language using which user can do a wide range of bioinformatics data analysis. Packages, which host a variety of tools, can help analyze bioinformatics data such as Microarray, differential gene expression, SNP, flow, PCR and other data handling. Using a package of R researches can perform the above-mentioned data analysis as well as much more. For example, the package of R can analyze end-to-end NGS data or microarray data without much manual intervention.
One of the NCBI resources, Gene Expression Omnibus (GEO), uses R to analyze microarray data available in the database online, which analyze the data and do mapping of probes to genes making it easier for the non-bioinformatics researcher to perform their own analysis.
There are many bioinformatics databases that used R for downloading and accessing the data these include Ensembl which uses biomaRt, TCGAbiolinks which use to access TCGA cancer data and many other webservers. Other than that R is also used to identify motifs in the sequences and can do mutation analysis. In mutation analysis allele-specific expression can be calculated in R. R language can be used to create HTML pages with inbuilt APIs which can link the database to the frontend with ease. This can help in setting up a bioinformatics webserver with minimal effort using Rstudio and RShiny. R is also being used to analyze data from flow cytometry, PCR and other low-throughput methods. Also, alignments can also be done using R language.
There is much more application of R in bioinformatics as almost all the data analysis in bioinformatics can be done using the R package.
Source of the Blog: https://en.novogene.com/resources/blog/hello-r-world-introduction-to-r/
References:
- Roser, L. G., Agüero, F. & Sánchez, D. O. FastqCleaner: An interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files. BMC Bioinformatics (2019) doi:10.1186/s12859-019-2961-8.
- iAnders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. (2010) doi:10.1186/gb-2010-11-10-r106.
- Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (2009) doi:10.1093/bioinformatics/btp616.
- Trakhtenberg, E. F. et al. Cell types differ in global coordination of splicing and proportion of highly expressed genes. Sci. Rep. (2016) doi:10.1038/srep32249.
- Jha, A., Mehra, M. & Shankar, R. The regulatory epicenter of miRNAs. J. Biosci. 36, 621–638 (2011).
- Jha, A., Panzade, G., Pandey, R. & Shankar, R. A legion of potential regulatory sRNAs exists beyond the typical microRNAs microcosm. Nucleic Acids Res. 43, 8713–24 (2015).
- Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. (2015) doi:10.1093/nar/gkv007.
- Hill, J. T. et al. Poly peak parser: Method and software for identification of unknown indels using sanger sequencing of polymerase chain reaction products. Dev. Dyn. (2014) doi:10.1002/dvdy.24183.
- Ru, Y. et al. The multiMiR R package and database: Integration of microRNA-target interactions along with their disease and drug associations. Nucleic Acids Res. (2014) doi:10.1093/nar/gku631.
- Zhang, J. et al. MiRspongeR: An R/Bioconductor package for the identification and analysis of miRNA sponge interaction networks and modules. BMC Bioinformatics (2019) doi:10.1186/s12859-019-2861-y.
- Edgar, R. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).
- Yao, Z. et al. Discriminative motif analysis of high-throughput dataset. Bioinformatics (2014) doi:10.1093/bioinformatics/btt615.
- JKlinke, D. J. & Brundage, K. M. Scalable analysis of flow cytometry data using R/Bioconductor. Cytom. Part A (2009) doi:10.1002/cyto.a.20746.
- Ahmed, M. & Kim, D. R. pcr: An R package for quality assessment, analysis and testing of qPCR data. PeerJ (2018) doi:10.7717/peerj.4473.
- Bodenhofer, U., Bonatesta, E., Horejš-Kainrath, C. & Hochreiter, S. Msa: An R package for multiple sequence alignment. Bioinformatics (2015) doi:10.1093/bioinformatics/btv494.
May I ask what the purpose of this is? This is a plain copy without any formatting from https://en.novogene.com/resources/blog/hello-r-world-introduction-to-r/
Are you affiliated with Novogene (if so please indicate and reference the source of your post), if not then this might raise the question of plagiarism? Please clarify, thank you.
I am very sorry for this misunderstanding. It is not plagiarism, and I represent Novogene. I did not announce as Novogene is because I‘m not tending to promote here for any commercial purposes, but just want to share some technical knowledge and acquire new learnings. I'm gathering some BI experts' experience and want to share it here on NGS and bioinformatics.
I will be very careful and pay close attention to actions that might be considered as plagiarism. I ensure that all contents are original. I've edited this post with source derivation. My sincere apology for your confusion. Thank you for pointing this out.
No problem then, thank you for the clarification. I think (and this is just my personal opinion) it is good practice to indicate the source of any external information which you have done now.
Since I am quite fresh here, thanks for giving me these suggestions.
Because this is a plain copy-paste from the novogene page, it has numbers (pointers to specific references that the source has), but not the references themselves. Please either add those references or remove the numbers, as they're just confusing at the moment. DESeq2 is a thing, so your post saying "DESeq9" adds confusion.
Thank you so much for informing me. I've removed the reference numbers in the body of the article, and I find the platform seems to be unable to superscript the format of the reference. Citations are also added at the end of the post. Would it be possible if you could give me some editing suggestions?
Sure. I've been meaning to create a how-to on the super/subscript formatting. For super-script, you can use the
<sup> ... </sup>
HTML element. Here is an example: a2 + b2 = c2 (raw text:a<sup>2</sup> + b<sup>2</sup> = c<sup>2</sup>
)