Indexing a 2.6GB Plant Genome Using STAR (genomeGenerate Mode) Terminated Due to RAM Limit — Can the Genome Be Split for Indexing, or Are There Other Solutions?
1
0
Entering edit mode
1 day ago
Vijith ▴ 60

This was previously posted here. Here, I post the question for a wider reach.

I am trying to index my plant genome that was de novo assembled, using STAR aligner tool. The assembly file contains 2,976,459 contigs with N50 being 1,293kb.

The following command was used:

STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir /path/
--genomeFastaFiles /path/*.fa \
--genomeSAindexNbases 14 \
--genomeSAsparseD 2

And the error that was encountered was

EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=31000000000is too small for your genome

SOLUTION: please specify --limitGenomeGenerateRAM not less than 2080695648522 and make that much RAM available

System capabilities: Cpu with 8 cores and 244GB RAM.

One of the suggestions emphasized the contig counts being the potential culprit. I don't have genome assembly data of any related species under the same genus. If a reference genome of a closely related species is available, the scaffolding could be done using RagTag, but such data aren't available. So, what can be done here? Suggestion of any other memory-efficient tools to index?

star RAN-seq bam genome indexing • 428 views
ADD COMMENT
1
Entering edit mode

2,976,459 contigs

Is that adding up to 2.6G or you know the genome should be about that. Perhaps you have lots of redundant contigs etc and the actual data size is much more than 2.6. You may want to work on cleaning that up.

ADD REPLY
0
Entering edit mode

Did you mean total number of bases, that is 2,412,294,157. How to confirm that that the actual data is much more, and how to go about cleaning it up?

ADD REPLY
1
Entering edit mode

Have you made sure that there are no redundant contigs in terms of sequence (duplicates, contained within other etc) e.g. with a tool like cd-hit?

ADD REPLY
0
Entering edit mode

No, I haven't checked for duplicates. I went through the CD-hit pipeline and saw CD-hit-dup. Can you please explain how this is going to help?

ADD REPLY
0
Entering edit mode

Incidentally, my query is: With the existing assembly file (2,976,459 contigs ), will STAR still require more RAM than the system's capacity, even if I change the parameters? will I be able to index using HISAT with less RAM usage? My ultimate goal is to perform Braker with protein and RNA-seq data combined.

ADD REPLY
1
Entering edit mode

Doing QC on a newly built assembly is a must. If you have a lot of redundancy you will have a mess on your hands downstream if try to take it through as is.

Simplest advantage of removing redundancy would be decrease in size of the assembly and with it number of contigs. That should help with the memory requirements for the indexing.

ADD REPLY
0
Entering edit mode

Okay, GenoMax, thanks for pointing it out. I will use the tool that you've suggested (cd-hit) to remove duplicates if any, and later run the indexing process. Meanwhile, just curious, what is the source of this redundancy? poor assembly process, poor DNA quality, or failure in resolving heterozygosity?

ADD REPLY
0
Entering edit mode
10 hours ago

If you have 3million contigs it is not going to be possible to use STAR or any other aligner, it is just too fragmented. I once had a plant genome with 5m contigs and it was utterly useless.

You could try using a related reference and a tool like Ragtag to _attempt_ to build some sort of useful scaffolds/assembly.

But seriously - why do people try to create assemblies with short reads these days ? Sure, it might be cheap, but what do you actually learn ? Long reads are the way to go, as you can see in just about any paper in the current literature.

ADD COMMENT

Login before adding your answer.

Traffic: 1367 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6