This was previously posted here. Here, I post the question for a wider reach.
I am trying to index my plant genome that was de novo assembled, using STAR
aligner tool. The assembly file contains 2,976,459 contigs with N50 being 1,293kb.
The following command was used:
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir /path/
--genomeFastaFiles /path/*.fa \
--genomeSAindexNbases 14 \
--genomeSAsparseD 2
And the error that was encountered was
EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=31000000000is too small for your genome
SOLUTION: please specify --limitGenomeGenerateRAM not less than 2080695648522 and make that much RAM available
System capabilities: Cpu with 8 cores and 244GB RAM.
One of the suggestions emphasized the contig counts being the potential culprit. I don't have genome assembly data of any related species under the same genus. If a reference genome of a closely related species is available, the scaffolding could be done using RagTag, but such data aren't available. So, what can be done here? Suggestion of any other memory-efficient tools to index?
Is that adding up to 2.6G or you know the genome should be about that. Perhaps you have lots of redundant contigs etc and the actual data size is much more than 2.6. You may want to work on cleaning that up.
I think Genomax may well be right, there is something fishy about your stats.
Please use conda to install bbmap (from memory)
then run this on your contigs, and post the result here:
stats.sh x.fasta
colindaven thank you for this valuable suggestion. I will post the results. Meanwhile, I would like to share with you the assembly statistics generated by the CLC Genomics Workbench. https://drive.google.com/file/d/1MoDREyouaEO9Wj6TNXOhUoqhcy_zeD7v/view?usp=sharing. Can you go through it? Also, I like to give you Illumina demultiplex stats, too. Actually, of the total sequencing output, the sample read-pairs were 606 million, and 304 million were undetermined. Of the total sampled reads, 82.09% were one-mismatch reads. Could this have led to an assembly with too many contigs?
Did you mean total number of bases, that is 2,412,294,157. How to confirm that that the actual data is much more, and how to go about cleaning it up?
Have you made sure that there are no redundant contigs in terms of sequence (duplicates, contained within other etc) e.g. with a tool like
cd-hit
?No, I haven't checked for duplicates. I went through the CD-hit pipeline and saw CD-hit-dup. Can you please explain how this is going to help?
Incidentally, my query is: With the existing assembly file (2,976,459 contigs ), will STAR still require more RAM than the system's capacity, even if I change the parameters? will I be able to index using HISAT with less RAM usage? My ultimate goal is to perform Braker with protein and RNA-seq data combined.
Doing QC on a newly built assembly is a must. If you have a lot of redundancy you will have a mess on your hands downstream if try to take it through as is.
Simplest advantage of removing redundancy would be decrease in size of the assembly and with it number of contigs. That should help with the memory requirements for the indexing.
Okay, GenoMax, thanks for pointing it out. I will use the tool that you've suggested (cd-hit) to remove duplicates if any, and later run the indexing process. Meanwhile, just curious, what is the source of this redundancy? poor assembly process, poor DNA quality, or failure in resolving heterozygosity?