Dear All,
I have little experience with Bioinformatics. In the moment I do not have access to a HPC-Cluster which seems to be indispensable for big data analysis.
Do you know of a publicly available data set which is small enough and yet instructive, so as to be possible to perform the data analysis in the own laptop. I have a HP laptop with 7 CPU and 16 GB RAM. In the best scenario the data set would come from a eucaryotic genome for which there is a reference genome aiming at training on the RNA-seq, ChIP-seq or ATAC-seq.
Beside, is there any such thing as a public cloud computing so that one could analyze big data without much restrictions ?
I will highly appreciate any comment.
What are the specifications of your laptop? I perform DNA-, RNA-, and ChIP-seq on my laptop. I have processed entire TCGA datasets. The limitation is whole genome phasing.
Specs:
As I noted above the specs are:
Model: HP Pavillon
I know of a data set containing subsets of many RNAseq, ATAC-seq and ChIP-seq files (mouse genome) in a total volume of about 1TB. Do you think it is feasible in the case I would download the data one by one and process them in my laptop in a sequential way one by one? How long it would take to my HP Pavillon to process each such file ? What do you mean by whole genome phasing ?
Many thanks.
Whole genome phasing analysis is a 'compute intensive' task. It involves haplotype-resolving variants called in a particular sample. If you are not sure, do not worry about it. The zipped FASTQ files can be 30GB each.
RNA-seq analysis can be very quick if you use a pseudo-aligner, like Salmon. Exome seq can take quite some time, e.g., 4-12 hours to align reads and call variants in a single sample. ChIP- and ATAC-seq will depend on the marker used and how extensive the binding (and thus number of reads) was.
Will it make a big difference in time efficiency by upgrading the working memory from 16GB RAM to 32 GB RAM ? Is it worth ? Concerning the Salmon tool, do you mean using Salmon instead of DESeq ? I am asking this since I have not encountered Salmon in the tutorials on RNAseq I have looked at.
Salmon
is a transcript quantifier. It produces transcriptome abundance estimates. There are typically aggregated to the gene level with tximport (BioC package) followed by differential analysis withDESeq2
(or any other framework you prefer), see here. Salmon is not resource-hungry and very fast. The limitation factor will be the alignment of ChIP/ATAC-seq data. 16GB should be enough, CPU is the critical factor as there is a somewhat linear relationship between saving time and CPUs used (at least in the range of using 1 to about 16 cores, after that IO bottleneck kicks in more and more). Memory is only required to read the alignment index basically and store the reads currently processed by BWA (bowtie2 needs fewer memory in my experience, maybe give it a try), and in part to sort the alignment files. Still, do not waste too much private money, if you have limited computational resources you simple have to wait longer for the job to finish. Test your scripts with small datasets and once they are stable, just start and wait.Not sure that you need 32GB RAM for most things. Performing alignment of Exome-seq to the human genome just takes 5-6GB RAM. It may be better to think about parallelising the alignment process: A: Using parallel with Mem
Salmon will perform read count abundance. The output of Salmon can then be input to DESeq2 for normalisation and differential expression analysis. Another program is Kallisto.