Hi All,
I am trying to generate genome indexes using STAR (v2.5.0a), to do so I use this command:
'STAR --runThreadN 24 --runMode genomeGenerate --genomeDir /path/genomeDir --genomeFastaFiles /path/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile /path/Homo_sapiens.GRCh38.86.gtf --sjdbOverhang 74'
Both fa and gtf files are from ENSEMBL.
The generation seems to work (no error is displayed, neither in the command line nor in the log file) but when I look at the files generated I do not have any genome file as I should but only: chrLength.txt, chrNameLength.txt, chromeName.txt, chrStart.txt and genomeParameters.txt.
In the Log.out file, for the genome files generation, which I found a bit odd because of the high chr numbers, I have this:
Nov 17 09:20:27 ... starting to generate Genome files
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 0 "1" chrStart: 0
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 1 "10" chrStart: 249036800
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 2 "11" chrStart: 382992384
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 3 "12" chrStart: 518258688
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 4 "13" chrStart: 651689984
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 5 "14" chrStart: 766246912
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 6 "15" chrStart: 873463808
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 7 "16" chrStart: 975699968
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 8 "17" chrStart: 1066139648
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 9 "18" chrStart: 1149501440
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 10 "19" chrStart: 1229979648
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 11 "2" chrStart: 1288699904
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 12 "20" chrStart: 1530920960
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 13 "21" chrStart: 1595408384
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 14 "22" chrStart: 1642332160
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 15 "3" chrStart: 1693188096
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 16 "4" chrStart: 1891631104
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 17 "5" chrStart: 2081947648
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 18 "6" chrStart: 2263613440
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 19 "7" chrStart: 2434531328
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 20 "8" chrStart: 2593914880
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 21 "9" chrStart: 2739142656
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 22 "MT" chrStart: 2877554688
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 23 "X" chrStart: 2877816832
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 24 "Y" chrStart: 3034054656
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 25 "KI270728.1" chrStart: 3091464192
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 26 "KI270727.1" chrStart: 3093561344
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 27 "KI270442.1" chrStart: 3094085632
...
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 191 "KI270423.1" chrStart: 3137601536
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 192 "KI270392.1" chrStart: 3137863680
/path/Homo_sapiens.GRCh38.dna.primary_assembly.fa : chr # 193 "KI270394.1" chrStart: 3138125824
Number of SA indices: 5891698134
Nov 17 09:21:39 ... starting to sort Suffix Array. This may take a long time...
I tried changing both fasta and gif files and also STAR version, without success, I cannot seem to figure out how to make this work properly, any idea?
Thank you, L.
STAR needs about ~30Gb of RAM for human genome. Did you ensure that sufficient RAM was available?
How long did you wait? With GRCh37 (Ensembl fasta and GTF) and 8 threads and 40Gb RAM, it took me ~50mins to generate the index.
There is 192Gb of available RAM on the linux server I am using. It took an hour to complete the generation step.
Do you need the alt contigs/haplotypes? Otherwise you could take those out of the reference and generate the index.
Alex has some ready made genome indexes available here (Does include GRCh38).
I haven't worked on the GRCh38 ver. of the genome. But by 'high car numbers' did you mean chr? Those are alt. contigs. and in case of GRCh37_primary_assembly there are >80. Here is the snippet of the log file, in case it helps -
May be an obvious question but do you have enough space available on disk (and/or in /tmp)? I am not sure if STAR uses /tmp to temporarily hold files/data.
Currently I have approximately 600Gb of free space, I guess that should be more than enough for the generation step.
My suspicion is that the sorting process (last line of log file) is getting killed by Kernel itself. The kernel can kill any erratic and resource hungry process without the process having any chance to grab and report the error signal. You may try these (in your order of preference) :
1) Could you re-run restricting the memory usage by using these param: --genomeSAindexNbases 12 (or even 10) --genomeSAsparseD 3 (see manual) and if that doesn't work, also try changing limitGenomeGenerateRAM to a lower limit (25GB?)
--limitGenomeGenerateRAM 25000000000
1) Even if your machine has huge RAM and disk space, you might be bound by personal quotas. Could you paste output of
ulimit -a
andquota
commands here?2) If you have permissions, check the kernel log message (grep STAR /var/log/kern.log) and system log message (grep STAR /var/log/syslog) for any unusual words like killed or aborted.
Also, just to be sure, is that your complete log file?
Thanks for the advices. I will try all your suggestions and see how it goes.
for ulimit -a I obtain:
and for quota I get '
none
'.It is just an extract of the log file.
Looks everything normal here. Please also post the last lines 20-30 of log-file.
Open files/stack size may be things to follow-up on (increase both) in case you are not able to find anything else.
You may want to check with your sys admins to see if they are able to look in kernel/system logs for any other clues as suggested by @Santosh before.
Stack limit is not the culprit. It has the same (default) value on my box where I can easily do indexing. On second thoughts, open files could be if the sort or other process creating too many tmp files. Kernel / syslog might be the way to go. You may also post this in STAR mailing list. Alex is usually very responsive. And please post the answer when you get it. It's a curious case!
I tried your 1), without success so far. Apparently there has been a mixup in the installed versions and may be using 2.4.1c rather than 2.5.0a. I will try to change it and see how it goes.
what the last lines of logs says in two cases? same??
Same thing yes. With the 2.5.0a version it worked. Thanks for the help.
Thank you. Good to know that it worked finally.