Indexing a 2.6GB Plant Genome Using STAR (genomeGenerate Mode) Terminated Due to RAM Limit — Can the Genome Be Split for Indexing, or Are There Other Solutions?
1
0
Entering edit mode
6 weeks ago
Vijith ▴ 90

This was previously posted here. Here, I post the question for a wider reach.

I am trying to index my plant genome that was de novo assembled, using STAR aligner tool. The assembly file contains 2,976,459 contigs with N50 being 1,293kb.

The following command was used:

STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir /path/
--genomeFastaFiles /path/*.fa \
--genomeSAindexNbases 14 \
--genomeSAsparseD 2

And the error that was encountered was

EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=31000000000is too small for your genome

SOLUTION: please specify --limitGenomeGenerateRAM not less than 2080695648522 and make that much RAM available

System capabilities: Cpu with 8 cores and 244GB RAM.

One of the suggestions emphasized the contig counts being the potential culprit. I don't have genome assembly data of any related species under the same genus. If a reference genome of a closely related species is available, the scaffolding could be done using RagTag, but such data aren't available. So, what can be done here? Suggestion of any other memory-efficient tools to index?

star RAN-seq bam genome indexing • 3.3k views
ADD COMMENT
1
Entering edit mode

2,976,459 contigs

Is that adding up to 2.6G or you know the genome should be about that. Perhaps you have lots of redundant contigs etc and the actual data size is much more than 2.6. You may want to work on cleaning that up.

ADD REPLY
2
Entering edit mode

I think Genomax may well be right, there is something fishy about your stats.

Please use conda to install bbmap (from memory)

conda install -c conda-forge -c bioconda bbmap

then run this on your contigs, and post the result here:

stats.sh x.fasta

ADD REPLY
0
Entering edit mode

colindaven thank you for this valuable suggestion. I will post the results. Meanwhile, I would like to share with you the assembly statistics generated by the CLC Genomics Workbench. https://drive.google.com/file/d/1MoDREyouaEO9Wj6TNXOhUoqhcy_zeD7v/view?usp=sharing. Can you go through it? Also, I like to give you Illumina demultiplex stats, too. Actually, of the total sequencing output, the sample read-pairs were 606 million, and 304 million were undetermined. Of the total sampled reads, 82.09% were one-mismatch reads. Could this have led to an assembly with too many contigs?

ADD REPLY
0
Entering edit mode

Did you mean total number of bases, that is 2,412,294,157. How to confirm that that the actual data is much more, and how to go about cleaning it up?

ADD REPLY
1
Entering edit mode

Have you made sure that there are no redundant contigs in terms of sequence (duplicates, contained within other etc) e.g. with a tool like cd-hit?

ADD REPLY
0
Entering edit mode

No, I haven't checked for duplicates. I went through the CD-hit pipeline and saw CD-hit-dup. Can you please explain how this is going to help?

ADD REPLY
0
Entering edit mode

Incidentally, my query is: With the existing assembly file (2,976,459 contigs ), will STAR still require more RAM than the system's capacity, even if I change the parameters? will I be able to index using HISAT with less RAM usage? My ultimate goal is to perform Braker with protein and RNA-seq data combined.

ADD REPLY
1
Entering edit mode

Doing QC on a newly built assembly is a must. If you have a lot of redundancy you will have a mess on your hands downstream if try to take it through as is.

Simplest advantage of removing redundancy would be decrease in size of the assembly and with it number of contigs. That should help with the memory requirements for the indexing.

ADD REPLY
0
Entering edit mode

Okay, GenoMax, thanks for pointing it out. I will use the tool that you've suggested (cd-hit) to remove duplicates if any, and later run the indexing process. Meanwhile, just curious, what is the source of this redundancy? poor assembly process, poor DNA quality, or failure in resolving heterozygosity?

ADD REPLY
2
Entering edit mode
6 weeks ago

If you have 3million contigs it is not going to be possible to use STAR or any other aligner, it is just too fragmented. I once had a plant genome with 5m contigs and it was utterly useless.

You could try using a related reference and a tool like Ragtag to _attempt_ to build some sort of useful scaffolds/assembly.

But seriously - why do people try to create assemblies with short reads these days ? Sure, it might be cheap, but what do you actually learn ? Long reads are the way to go, as you can see in just about any paper in the current literature.

ADD COMMENT
1
Entering edit mode

why do people try to create assemblies with short reads these days ?\

In my opinion, this is simply because most do not know any better (or are simply too overwhelmed to incorporate new sequencing techniques into their workflows), and sequencing as a technology is only really recently becoming available to groups that are not at the cutting edge of research and, more importantly, are strongly constrained by funding (e.g., many groups working in ecology and so forth).

ADD REPLY
1
Entering edit mode

Yeah I realize the PI generally just sequences something and then the Phd student of postdoc is left to sort it out. I don't want to be too critical, but at some stage honest feedback is needed. I think my first pacbio assembly was 2014, so long reads have been very effective for quite some time.

ADD REPLY
1
Entering edit mode

Yeah I realize the PI generally just sequences something and then the Phd student of postdoc is left to sort it out.

The situation is somewhat more complicated than that. In many cases, the PIs are completely oblivious to the utterly insane workloads they end up thrusting upon the technicians, PhDs, PostDocs, and other souls in their groups.

For instance, there seem to be many labs that expect PhDs and PostDocs without the requisite bioinformatics backgrounds to simply acquire this knowledge on the fly while also handling lab work (and potentially student supervision and teaching) all on a very tight financial and temporal budget. This leads to a situation where people simply elect to go with whatever is well-documented, well-published, and essentially "easy" (alongside being as inexpensive as possible) in their respective fields; I think it is often the case that entire sets of workflow scripts are handed down to newcomers in the groups that someone had the misfortune of setting up years ago. Nobody in these groups really has the time, energy, nor resources to implement new workflows (for incorporating long read sequencing, for example).

As a result, in ecology and anything that isn't cutting edge biology, in my experience, sequencing remains mostly short read sequencing.

but at some stage honest feedback is needed.

The feedback is presumably welcome but I don't think it is necessarily helpful unless we are ready to volunteer our time to interested groups to help them implement new workflows and techniques.

ADD REPLY
0
Entering edit mode

colindaven I completed the STAR indexing by specifying a few parameters: I set --genomeSAindexNbases to 14; --genomeChrBinNbits to 9.6; --genomeSuffixLengthMax to 300; ---genomeSAsparseD to 2. So, I could index the genome. Regarding scaffolding using RagTag, I don't have a reference genome of a closely related species. Is it worthwhile to use a reference genome of a closely related genus?

ADD REPLY
2
Entering edit mode

Well done Vijith, but how far is it going to get you ? You have 3m contigs and estimated 30,000 protein coding genes. That is going to be tough to align RNA-seq reads too, since almost all genes are likely to be fragmented. You can try using a related reference genome but another genus is likely a long long way away. Do you have a transcriptome instead or can you build one from RNAseq with Trinity etc, then pseudoalign your RNA-seq data to with a tool like Salmon ?

ADD REPLY
1
Entering edit mode

I agree with colindaven 's suggestion to (sequence and de novo) assemble a transcriptome at this stage, especially if the objective is to simply construct a catalog of genes. That genome of yours, as it stands, is too fragmented to be of much use.

ADD REPLY
0
Entering edit mode

Dunois As I responded to colindaven, we haven't done RNAseq - of the same tissue as the one used for WGS; but some other group had earlier performed for four different tissues, and there are four sets of RNAseq data. While I was able to get their transcriptome data, their quality was questionable (BUSCO stats). So far, my idea was to align all those RNASeq data with my genome and assemble the aligned reads using Stringtie, later use them as evidence - along with protein evidence - for braker.pl I don't think I have fully grasped the pseudoalign process suggested by colindaven but I am super willing to explore that possibility if that can help with the annotation despite my fragmented assembly file.

ADD REPLY
0
Entering edit mode

You mentioned that the RNA-Seq data was contaminated. If the contaminants are predominantly of non-plant origin, theoretically, you could map the reads using Kraken2 or something similar to filter out the contaminant reads and reconstruct a "cleaned" transcriptome.

The pseudoalignment step that colindaven mentioned is in context of asssembling and quantifying gene expression with a transcriptome, I believe. Doubt that will help much to improve the genome you have, unless I've misunderstood something here.

If I may ask, what is the actual purpose of assembling this genome of yours?

ADD REPLY
0
Entering edit mode

Dunois What I have is an assembled genome of a monocot plant. The assembly was performed using clc genomics. Prior to assembly, I had estimated the size of the genome using kmergenie and it predicted 3.3gb, but the actual assembly data is 2.6gb. My next objective is to functionally annotate this genome for coding sequences. I am expecting ~20k cds.

ADD REPLY
1
Entering edit mode

Sure, but why? What's the end goal of producing this genome? Is it simply having the genome published or is there some other set of downstream analyses planned?

ADD REPLY
0
Entering edit mode

Dunois First and foremost, we want to publish this genome. This monocot plant is immersed in our community's culture, a major livelihood for all tribesmen. As there was no reference genome, we planned to de novo assemble it. So far, we have annotated the transposable elements and identified all non-coding RNAs. Now, I would like to functionally annotate this genome. Later, we would like to find out all the sugar metabolism-related genes, drought-resistant genes, and disease-resistant genes. This is a dioecious plant, so we are also interested in identifying all the sex-specific genes.

ADD REPLY
0
Entering edit mode

Dunois I would like to share with you the assembly statistics generated by the CLC Genomics Workbench. https://drive.google.com/file/d/1MoDREyouaEO9Wj6TNXOhUoqhcy_zeD7v/view?usp=sharing. Can you go through it? Also, I like to give you Illumina demultiplex stats, too. Actually, of the total sequencing output, the sample read-pairs were 606 million, and 304 million were undetermined. Of the total sampled reads, 82.09% were one-mismatch reads.

ADD REPLY
3
Entering edit mode

Hi Vijith, this contig set is not publishable. The contigs are all completely fragmented. I assemble plant genomes all the time with long reads and get from 50-1000 contigs - but never 3 million. Just take an ORF finder like Augustus, predict genes and visualize the results in IGV. You'll see what I mean.

Please work towards getting long reads for this - look at the nanopore diversity programs or big biodiversity projects, get yourself a Minion starter pack etc.

I would expect publishable plant genomes to have at least 1-2 MB N50, better 30-70 MB N50 (and yes we get these stats all the time).

ADD REPLY
1
Entering edit mode

colindaven, thank you for your helpful feedback. However, I'm feeling a bit helpless. You mentioned that the least acceptable N50 value is 1-2 Mb, but my assembly has an N50 of ~1300 Kb (including scaffolds). Is this not acceptable? As someone with extensive expertise, could you confirm whether the fragmentation of contigs might be due to the ~304 million undetermined read-pairs that weren't included in the assembly (as mentioned in my previous comment)? Or could it be due to poor DNA quality or shearing? I've already performed de novo gene prediction using AUGUSTUS, which identified ~50k genes. Unfortunately, long-read sequencing is not feasible at this point due to costs and limited resources (neither my PI nor our institution has genomics expertise or a dedicated bioinformatics group). I'd appreciate your suggestions on: Would using an assembler like ABySS (I've already determined the optimal K-mer size) or Platanus help generate longer contigs if fragmentation occurred due to low-coverage regions? And, Two years ago, we sequenced DNA from the same tissue using the TELL-Seq kit (used NovoSeq 6000 sequencer). Although erroneous barcodes led to significant read loss (I recovered some reads, but the assembly was poor, resulting in a 25 Mb file with N50 ~1300 Kb), can I utilize that assembly file to scaffold my current assembly data and improve its accuracy

ADD REPLY
1
Entering edit mode

Sorry, I appreciate your concerns and yes, data is expensive, but in my experience you'll never get a good assembly with short reads. They are just too short. Please look at the literature but I doubt you'll see anything decent in the plant world published with short read assemblies. Plants are just way too repetitive, and repeats break assemblies. By the way - I think your assembly has a 1300bp not a 1300 kbp N50 (I looked at the clc stats, they were confusing - again - please do the bbmap stats.sh analysis I posted before to get a definitive independent answer).

ADD REPLY
0
Entering edit mode

colindaven Here, I provide the stats.sh results. Looking forward to your valuable input: https://learnwithscholar.notion.site/BBTools-stats-sh-13bfbc19544c8071b8f4c201edde8c7a

ADD REPLY
2
Entering edit mode

Thank you for your detailed responses and the data.

I must unfortunately concur with colindaven here. This data set, as it stands, is not publishable, simply because of the fact that you really can't do anything with an assembly this fragmented.

So far, we have annotated the transposable elements and identified all non-coding RNAs.

With what data? This genome?

Would using an assembler like ABySS

Wait, now I am curious. What assembler did you use? How, if at all, were the reads (pre-)processed prior to assembly?

I recovered some reads, but the assembly was poor, resulting in a 25 Mb file with N50 ~1300 Kb

The N50 of this older data set is better than that of your current genome? I doubt it, but it might be worth trying to assemble a genome with all the data available to you (this old data set as well as your current data).

Or could it be due to poor DNA quality or shearing?

Did you measure DNA integrity and quality prior to sequencing?

You also mentioned BUSCO scores. Against which BUSCO (or more precisely OrthoDB) data set did you measure these scores?

Unfortunately, long-read sequencing is not feasible at this point due to costs and limited resources

Do you potentially still have sufficient funding to do a round of RNA-Seq? You could do de novo transcriptome assembly and get a representative gene catalog.

Unless you've gone very wrong somewhere in pre-processing the reads and/or assembly, I feel that the chances of "rescuing" this genome in its current form are exceedingly low.

ADD REPLY
0
Entering edit mode

Dunois As colindaven had instructed, I provide the stats.sh results. I would kindly request you too to take a look at this information and suggest a valuable input: https://learnwithscholar.notion.site/BBTools-stats-sh-13bfbc19544c8071b8f4c201edde8c7a

ADD REPLY
2
Entering edit mode

the sample read-pairs were 606 million, and 304 million were undetermined. Of the total sampled reads, 82.09% were one-mismatch reads.

While one mismatch reads are fine it is curious that you had 1/3rd "undetermined" reads which must have more than 1 error in index or were simply junk indexes. Poor DNA quality may be a reason for this.

Why was this sample indexed if it was a single sample. Did it run as a part of the pool or was it on a lane (or more) by itself.

ADD REPLY
1
Entering edit mode

Dunois and GenoMax

With what data? This genome?

The non-coding RNAs were detected using the Infernal pipelines on the current assembly data.

Wait, now I am curious. What assembler did you use? How, if at all, were the reads (pre-)processed prior to assembly?

The reads were adapter-trimmed using Trimmomatic (of the total

input read-pairs of 606076779, a forward-only surviving reads of 38579532, and both surviving 566832403 and dropped none) and the surviving read-pairs were assembled using the CLC Genomics workbench (all parameters were kept default). Before switching to CLC Genomics, the assembly was attempted using ABySS, which was based on K-mer approximation and therefore required less memory, but somehow, the program was terminated.

The N50 of this older data set is better than that of your current genome? I doubt it, but it might be worth trying to assemble a genome with all the data available to you (this old data set as well as your current data).

I didn't mean to say the N50 of the previous assembly was better than my current assembly. The N50 (earlier assembly) is too short (the value I mentioned in my previous comment was wrong). Here, I would request you to take a look at the earlier assembly stat by TELL-LINK: https://drive.google.com/file/d/1wrUNDsiwIFE31PqZlpquHbjIqSrIZzEE/view?usp=sharing

Did you measure DNA integrity and quality before sequencing?

DNA quality was checked on a gel, and it was interpreted to be good (unfortunately, I was not involved in the wet lab part).

Actually it was the aweMAG pipeline (BUSCO included) that I ran for the transcriptome data published by another group. I wanted to use this transcriptome data for brake evidence, to improve my prediction. It was run on auto-euk mode, and downloaded eggnog, swissprot, and BUSCO fungal databases.

Do you potentially still have sufficient funding to do a round of RNA-Seq? You could do de novo transcriptome assembly and get a representative gene catalog.

Dunois there is a serious lack of funding, and there is no way a transcriptome can be performed. My PI will be retiring in two years, and I doubt he would be interested in an RNA seq.

Unless you've gone very wrong somewhere in pre-processing the reads and/or assembly, I feel that the chances of "rescuing" this genome in its current form are exceedingly low.

My queries are: 1. Can I somehow make use of the contigs output by the previous assembly (please go through the stats I shared with you), to improve the quality of my present assembly?

  1. Any possibility of merging the old TELL-reads (the library preparation was different and used Transposase mediated molecular barcodes, so I guess the reads have unique molecular barcodes added to them. But I have the list of all their unique barcodes with me, therefore can be removed) with my current Illumina reads and perform a fresh assembly.

  2. Can I make use of the RNA seq/ transcriptome data generated by another group (of the same plant)? After BUSCO, I have a separate list of unique high-quality transcripts, duplicated transcripts, and fragmented transcripts. But there is a serious contamination of ~50%

ADD REPLY
2
Entering edit mode

Thank you for sharing all this information.

I have a few thoughts:

stats.sh from BBtools seems to indicate that you have:

Number of scaffolds > 50 KB:            5
% main genome in scaffolds > 50 KB:     0.02%

Since you said that BUSCO indicated contamination, maybe try extracting these 5 long contigs from the assembly and check the BUSCO scores against OrthoDB's viridiplantae data set for these contigs alone. I highly doubt it myself but there is a chance that the shorter sequences are all mostly contaminants and these long contigs represent (at the very least some) non-repetitive portions of the plant genome. You could also use some other length threshold (instead of 50KB) and see if you are basically able to "manually" filter for contigs associated with the plant genome. While this will filter out contaminants, this is unlikely to address the issue of fragmentation, unless the inclusion of the TELL-Seq data helps bridge some gaps.

The non-coding RNAs were detected using the Infernal pipelines on the current assembly data.

I would be very cautious in trusting this data then.

  1. Can I somehow make use of the contigs output by the previous assembly (please go through the stats I shared with you), to improve the quality of my present assembly? Any possibility of merging the old TELL-reads (the library preparation was different and used Transposase mediated molecular barcodes, so I guess the reads have unique molecular barcodes added to them. But I have the list of all their unique barcodes with me, therefore can be removed) with my current Illumina reads and perform a fresh assembly.

The only suggestion I can think of is to try and reassemble the genome using all the data (including the TELL-Seq reads) from scratch following the workflow (or something similar to it) outlined below:

  1. Trim adapters and discard read pairs with even a single N base (i.e., ambiguous base call) using fastp.
  2. Take the cleaned reads and scan them against the PlusPF dataset (https://benlangmead.github.io/aws-indexes/k2) using Kraken2 to remove read pairs originating from non-plant sources.
  3. Assemble all the data together (simply concatenate the respective forward reads files into a single forward read file and ditto for the reverse reads) using SPAdes or perhaps clover (https://doi.org/10.1186/s12859-020-03788-9).
  4. Check assembly quality using BUSCO w/ OrthoDB viridiplantae and other metrics.

If you are very lucky, you might find yourself with an improved assembly. I'd say it's worth a shot.

Can I make use of the RNA seq/ transcriptome data generated by another group (of the same plant)? After BUSCO, I have a separate list of unique high-quality transcripts, duplicated transcripts, and fragmented transcripts. But there is a serious contamination of ~50%

If you have a clean genome, this data might be useful for annotation, but outside of that, I don't really think so. The best you could do is do a reciprocal best hit (RBH) search between the genes from your genome and the "genes" from this transcriptome and try and filter out the contaminants as those sequences that do not have a hit in the RBH search from the genome's side.

My PI will be retiring in two years, and I doubt he would be interested in an RNA seq.

RNA-Seq, if you do it right, is very fast. De novo transcriptome assembly is no where as tedious as de novo genome assembly. Theoretically, with some help, you can have the assembly and annotation ready in as little as a week after sequencing.


All this said, if what you basically need is a published paper with this data without any additional sequencing, I think the most you could do is probably reassemble the genome from scratch using all the data you have as I have suggested, supplement its annotations using the RNA-Seq data you mentioned, do some mining on the annotations (e.g., look for drought resistance genes or whatever), and submit the entire thing to some relatively low impact factor journal. Technically, you could even do an analysis of the contaminants (I am presuming these are actually just plant-associated bacteria and the likes) alongside, and while the genomic data itself might be more or less useless as it stands, the work done will represent useful science nevertheless.

I do also have one question: this RNA-Seq data from this other group, has a transcriptome based on it been published already? If not, you could assemble that data into a de novo transcriptome and publish that instead.

ADD REPLY
0
Entering edit mode

Dunois, I apologize for this belated response. As suggested, I would like to share with you the BUSCO results on the contigs of varying length thresholds, i.e., for >50kb, 10kb, 5kb, and 1kb. Additionally, I have the BUSCO results for the entire assembly data. Actually, I don't have the means to revisit any experimental procedures, i.e., to improve sequencing depth or employ long-read sequencing methods. Here are the BUSCO results:https://learnwithscholar.notion.site/BUSCO-149fbc19544c802f9710ff7330be4eaf

ADD REPLY
0
Entering edit mode

Thank for you this information.

Those BUSCO scores look really funky. For instance, the 50kb (and longer) contigs have no genes in them (according to BUSCO anyway). So what are these sequences? Just repeats? It's worth taking a look into that.

I unfortunately do not have any other recommendations at this juncture other than to try and reassemble the genome following the suggestions I had put forth in my previous response to you.

And I would urge you to consider the notion that any published sequencing data is better than no data at all in the case of species that have not been well-sequenced. So a publication putting out this data to the public and presenting an analysis of this data (e.g., the long, apparently gene-free contigs) is still valuable.

ADD REPLY
0
Entering edit mode

GenoMax it was not a single sample. Multiple samples had been pooled in a sequencing run. I don't know for sure if this is attributed to the poor DNA quality. Let me quote the Universal Sequencing Technology (UST) who mentioned in a mail sent to me:

This report was generated in August 2022. One of the things you can see here is that the input DNA molecule length is 16KB, which is very short and results in average of 2-3 reads per molecule. We aim for 12-15 reads for each molecule, which makes the most of the linked reads.

The same DNA molecules were used for the current sequencing as the ones that were used for the linked-read generation. They had the tissue cryopreserved for an year.

ADD REPLY
1
Entering edit mode

colindaven thanks for the insights. We never did RNAseq, but data are publicly available by a different group. I checked the completeness of their transcriptome data (assembled using Trinity) which is ~87% complete with ~50% contamination. The blastx annotated ~8000 CDS from among the total predicted, and now I want to improve it using braker's evidence-based prediction. Can you please elaborate a bit on the pseudo-align process and its effectiveness? NB: Actually, my PI is not a genomics person and this is my first whole genome project, and I'm learning most concepts on the fly.

ADD REPLY

Login before adding your answer.

Traffic: 1645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6