Hi there,
I am currently trying to index a huge genome (8.3 Gbp) and provided the exons and splice sites, as recommended in the HISAT2 manual. As you can imagine, running this has been taking up a lot of memory, but after a long time the code is still running, and it says it is at its 7th generation. My question is: how many generations does the index builder normally go through (are we almost there, or is it time to abort the attempt of building the index?)
Would it be faster/more convenient to try to build the index without providing the exon and splice site data, and how relevant would that index still be for downstream transcriptomics analysis?
Thanks for any clarification, Nienke
Index with --ss and --exon options on large genomes (e.g. human, mouse, zebrafish etc.) only if you have more than 200 GB RAM. If not index simply like this
You can provide the exon information at the time of alignment like this
Best,
I am not an expert but I did get an opinion from a person who provide core services at the NIH, that you need to be have the transcriptome GTF for building the index. Do it without the exon information and it will be fine. You will be using the annotation file while quantifying.
I am doing exactly that and my analysis doesn't look bad. However, I did realize later on that using exon information in index is crucial if your focus is on splicing isoforms. For a normal DEG analysis I wouldn't worry.
Also you can use pre-built index from the HISAT2 webpage, however, then you have to have the GTF file for that version in the quantification step. For example, I was using pre-built index that was using version 86 GTF, but it gave errors when I used gencode v92 GTF.