Hi,
I am a new person here. I am trying to make an index file from gencode lncrna annotation. I did the following.
extract splice set information
hisat2_extract_splice_sites.py gencode.v28.long_noncoding_RNAs.gtf >gnc.ss
extract exon information
hisat2_extract_exons.py gencode.v28.long_noncoding_RNAs.gtf >gnc.exon
run build
hisat2-build -p11 --ss gnc.ss --exon gnc.exon geno GRCh38.p12.genome.fa genlnc
The computer I am using is a windows workstation with 12 cores (I am using 11 cores, but it hardly uses 10% CPU at most). It shows installed RAM as 45 GB, of which it is using almost 40 GB for hisat2-build. I started the process on Monday and even though the computer is running continuously, it hasn't built the index. The hisat2 paper suggested that building an index for whole genome with 160 GB should take 2-3 hours. So I am confused why it hasn't finished even in 5 days if I have 1/4 of recommended RAM.
Before I tried using the primary assembly file to make the index and it didn't finish in two weeks. So I thought may the primary assembly file is too big and switched to p12. When I try to run ls -lh, I see that the biggest file is .rtf file which I read is a temporary file. Right now it is 42 GB. I am using cygwin to run linux commands on the windows. Am I missing something? Please advise.
Also on the side, could you tell me difference between using primary assembly and p12 or newer assembly for making index file?
I don't have the computational explanation you're looking for, unfortunately, but I don't think the time to completion scales down linearly in the way you're expecting. I tried building an index with 32GB of RAM and it failed - I think the index build needed to load more data than that into the memory. I eventually used a cluster and assigned ~200GB to the operation, and it ran smoothly. If you have access to cloud or cluster resources I recommend you go that route.
Hi Russ. Thanks for the reply. I do have access to a Linux cloud/cluster. But I couldn't install hisat2 over there. Any tips for that?
You'll have to talk to the sys admin of your cluster if you don't have the privileges to install hisat2.
Have you tried installation using (bio)conda?
Hi Wouter. I was able to download the hisat2 and add it to the path on linux server. Now I am running into the problem of libstdc++.so.6 bot being updated. I asked the server manager and he said the system is old and updating is a pain. Could you tell me if there are some linux servers I can access and perform this and if they are free.
So you have tried installation using bioconda?
No I just downloaded and unpacked the binary for Linux from hisat2 and added the directory to path.
Why don't you try installation using bioconda?
Couldn't install miniconda because of the same libstdc problem. :(
Right, well, that sucks.