Hello good peeps,
Recently I started analysis of RNA-seq data mostly by a self-learning method (articles, online tutorials), and unfortunately, I do not have access to a high-performance computing cluster. I am working on a machine with an Intel I9 12th Gen processor, 32 GB DDR5 RAM.
I am using VM Ubuntu and Terminal for the analysis. Although I could do the whole analysis on Galaxy but thought it would be better to learn how to do it with scripts. So, past few days I have been stuck in the alignment step. I am trying to create an index of hg38. Here is my command.
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /home/oliver/calc/rawfiles/annot --genomeFastaFiles /home/oliver/calc/rawfiles/hg38.fa --sjdbGTFfile /home/oliver/calc/rawfiles/hg38.refGene.gtf --sjdbOverhang 100
The process was killed at the following: Jun 17 13:53:29 ... loading chunks from disk, packing SA... terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc /usr/bin/STAR: line 7: 68768 Aborted (core dumped) "${cmd}" "$@"
And in the last attempt, the command was this:
STAR --runThreadN 6 --runMode genomeGenerate --genomeDir /home/oliver/calc/rawfiles/annot --genomeFastaFiles /home/oliver/calc/rawfiles/hg38.fa --sjdbGTFfile /home/oliver/calc/rawfiles/hg38.refGene.gtf --sjdbOverhang 100 --limitGenomeGenerateRAM 18000000000
In this one, I could reach till SA_47 file in the mentioned directory (genomeDir = annot), but then the process crashed again, and I could not see the previous SA files anymore.
Could anyone please help with this issue and suggest how to solve it? Is there anything wrong with the command or is it just an issue of less RAM? In the VM, the memory looks like this.
oliver@oliver-VirtualBox:~$ free total used free shared buff/cache available Mem: 24414620 996720 22690664 30064 727236 23030164 Swap: 2097148 57180 2039968
Sorry for the long post. Hoping for helpful responses. Thanks.
I think it is just the memory, and running this in a VM further does not help because you need to allocate some RAM to the host system as well. Sorting the suffix array in STAR is a very memory intensive step and may be futile with your amount of RAM. You could likely run an alignment if you can get a pre-built genome index from somewhere.
Here is a link https://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/Human/ but I am not sure if the indices are compatible with the latest version.
Hello Michael, thanks for taking the time to reply and your suggestions. I have a few questions,
Again, thanks a lot.
It might be worth trying to build your index with HISAT2, but it also needs a lot of memory for generating the index. There was a recent post on that, and there are some options to reduce memory requirements. If it works, most likely only without using splice site and exon annotation. Another option is to use Salmon or Kallisto, these tools are more suited to consumer hardware.
For downloading, try to download the full https://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/Human/GRCh38_Ensembl99_sparseD3_sjdbOverhang99/ directory with wget -r
I tried to run the STAR alignment with the pre-built index from the link that you shared. The following was the command.
STAR --runThreadN 4 --genomeDir Index --readFilesIn /mnt/d/RNA_seq/Files/'Control_R1.gz' /mnt/d/RNA_seq/Files/'Control_R2.gz' --readFilesCommand zcat --outFileNamePrefix alingments/trial_1 --outSAMtype BAM Unsorted
But it failed saying, 'EXITING because of FATAL error, could not open file Index/chrName.txt SOLUTION: re-generate genome files with STAR --runMode genomeGenerate'
So, probably as you mentioned, it is not compatible anymore?
But, thanks anyway.
If all you need is gene level quantification you can use Salmon instead, which uses considerably less memory and provides more accurate quantifications.
Pair that up by working on either a native Linux distro or via WSL2. These VMs are just another unnecessary layer of complexity.
ATpoint and rpolicastro thanks for your inputs. I get your point and next am trying to run this on Salmon on WSL2 (skipping VM for now). Thanks