How to speed Salmon index up
2
0
Entering edit mode
10 months ago

Hi all,

I am trying to build a salmon index using a gene catalog covering 4 million genes (the file size is 1.2G) in HPC. I use 3000G memory. Unfortunately, I have time limitation in the server and my task is cancelled due to time limitation (2 days).. So I think, indexing takes more than 2 days but cannot make it in that situation.

Is there any possibility to reduce the index time?

This is my basic salmon index command:

salmon index -t gene_catalog.fna -i gene_catalog.fna.index

Any advice or alternatives would be great !

Thank you

metagenome genome salmon • 2.5k views
ADD COMMENT
0
Entering edit mode

Are you running this on a cluster? Does the cluster have a NFS mounted file system?

ADD REPLY
0
Entering edit mode

Yes, this is a Unix-based HPC clusters, and I do that.

ADD REPLY
2
Entering edit mode
10 months ago

I indexed 1.2 million transcripts using salmon recently. It took about 2 hours. The biggest difference to your command is that I used 16 cores, where as you appear to be trying to do it with a single core. My command was:

salmon index       -p 16 \
                  --sparse  \
                   -t salmon_index/DAT/gentrome.fa.gz  \
                   -d salmon_index/DAT/decoys.txt \
                   -i salmon_index/agg-agg-agg.filtered.salmon.index
ADD COMMENT
0
Entering edit mode

This is so interesting. Actually I even tried 32 cores for this but nothing changed about time. I do not provide any decoys.txt and what is exactly --sparse? Maybe these are the case. What do you think?

ADD REPLY
0
Entering edit mode

So you did use -p 32 in your command line? You should also designate a tmp directory location that has fast disks (instead of using NFS mounted storage).

How did you prepare your input fasta? I feel that it is likely to have sequence redundancy. If you want to keep the duplicates then you will also need to use

  --keepDuplicates              This flag will disable the default indexing 
                                behavior of discarding sequence-identical 
                                duplicate transcripts.  If this flag is passed,
                                then duplicate transcripts that appear in the 
                                input will be retained and quantified 
                                separately.

--sparse actually will slow things down further

--sparse                      Build the index using a sparse sampling of 
                                k-mer positions This will require less memory 
                                (especially during quantification), but will 
                                take longer to construct and can slow down 
                                mapping / alignment
ADD REPLY
0
Entering edit mode

yes, I again just tried -p 32 and this is a non-redundant gene catalog based on metagenome assembly. That's a good point, I could temporarily use it on internal NVMe..

Okay, What about this? If I use bwa for indexing and alignment, then use salmon quant based on aligment files. Do you think I will get the same quantification output with the one I use salmon for indexing and alignment?

ADD REPLY
1
Entering edit mode

I would probably not use bwa, as it's very eager to softclip things, which salmon doesn't really like. You could use Bowtie2 + salmon quant and that would work well (from a quality perspective). Of course, you'll have much bigger intermediate BAM files and things will be slower because of the alignment.

Regarding the long indexing runtime, it's almost certainly the result of writing lots of small files to the NFS. We've known of this issue for a while now, but it's very difficult to fix. First because we can't easily reproduce it on our cluster, but second because the underlying tool that is used for compacted de Bruijn graph construction (TwoPaCo) simply creates many small files as part of its normal process, and this is a pathological case for NFS — it's unclear how exactly to fix it.

However, if you can use fast local scratch for the index construction (e.g. an SSD, nVME, or even an local HDD) that should make the index construction go much faster. It shouldn't take anywhere near 2 days, but rather a few hours.

ADD REPLY
0
Entering edit mode

Hey Rob,

Thanks for your answer. What is the exactly difference between bwa index and salmon index ? If It is very long answer, you do not need to explain it but I just tried bwa index and it took about 30 min.

Now I am trying to build a Salmon index on nVME and let's see how long it takes.

but second because the underlying tool that is used for compacted de Bruijn graph construction (TwoPaCo) simply creates many small files as part of its normal process, and this is a pathological case for NFS — it's unclear how exactly to fix it.

The parallelization may not completely solve the issue but at least, it could help distribute the load more efficiently, right? I do not know If the parallelization is already happening in that step.

Or I do not know If it is applicable but If I split my reference file into several chunks and index it, then can I somehow concatenate these index later?

Many Thanks!

ADD REPLY
1
Entering edit mode

What is the exactly difference between bwa index and salmon index ?

Many NGS related programs (mainly aligners) use data structures for storing information about the reference so it can be searched/looked up quickly. They result in a set of files which are not human readable (binary). Every program uses their own implementation of these so the "indexes" are program specific. They can't be interchangeably used.

but If I split my reference file into several chunks and index it, then can I somehow concatenate these index later?

I don't think you can do that since all data is considered when creating an index.

ADD REPLY
1
Entering edit mode

bwa is not splice aware but since you are dealing with metagenomic data (assuming it is prokaryote only) it may be usable. But otherwise use a splice aware aligner.

ADD REPLY
0
Entering edit mode

I was assuming they were planning to align directly to the transcriptome (the one they are trying to index). But yes, if aligning to the genome (and projecting to the txome), its always recommended to use a spliced aligner like STAR. In fact, STAR is the only one I am aware of that has the ability to project to the txome.

ADD REPLY
0
Entering edit mode

The set is supposed to be 4M non-redundant "genes". Not sure if this catalog is for prokaryotes or may contain Eukaryotic genes. OP will need to clarify if this data came from RNAseq or DNAseq originally.

ADD REPLY
0
Entering edit mode

I'm sorry for the confusion. The gene catalog contains a mixture of prokaryotic and eukaryotic genes and the catalog is genome-based. So, It is not a transcriptomic study.

ADD REPLY
0
Entering edit mode

Do you expect there to be multi-mapping? One of the advantages of programs like salmon is they can statistically deal with these. Otherwise you could simply align the data using an aligner and then count using featureCounts.

ADD REPLY
0
Entering edit mode

Yes, actually that is the reason I use salmon which can handle it (multiple mapping issue) as you said.

ADD REPLY
0
Entering edit mode

Perhaps you are looking for a tool like kraken2? That can be used to identify reads in metagenomics, but it is unclear what exactly you are aiming to achieve.

ADD REPLY
0
Entering edit mode

Well, I constructed non-redundant gene catalog and now I want to calculate gene abundance, so that's why I use salmon and the reason I choose salmon is that It can handle multi-mapping.

I think, Kraken2 is a totally different story in my case. Okay, It is still k-mer based and competitive but kraken2 has been designed for taxonomic assignment or contamination detection, right?

ADD REPLY
0
Entering edit mode

In an offline chat with author of Salmon he mentioned that if you have reads that multi-map to a large number of your "genes" those will end up getting discarded and will not be counted. Something you will need to keep in mind.

ADD REPLY
0
Entering edit mode

I see. That is good to know, then. Also If we are back to the main topic, even though I use NVMe for indexing of the gene catalog, It is still running for 17 hrs...

ADD REPLY
1
Entering edit mode

Hopefully it will complete in the 2 days you have. Otherwise you will need to rethink this strategy.

ADD REPLY
0
Entering edit mode
10 months ago

Okay, I wanted to correct my mistake in case anyone has same issue. I ran salmon index on NVMe harddisk on the HPC with a high memory (~3000G) and the number of cores (32)...

And I checked the log file of batch again, and just realized my cluster did not use the cores I selected

TBB Warning: The number of workers is currently limited to 0. The request for 31 workers is ignored. Further requests for more workers will be silently ignored until the limit changes..

I do not know why It did not use the core in that partition but this is another story. Then, I changed my partition and decrease the memory (~200G) and increased the core number (64) and try it on NVMe harddisk again,

~4 million genes is indexing in ~10 min...

First of all, this is my big mistake that the warning in log file was overlooked by me... Secondly, the number of core seems to be more important than memory and even NVMe. But just an assumption, I do not have any detailed benchmarking.

So @i.sudbery already answered the question. Thank you for all and sorry again for wasting your time guys.

ADD COMMENT
0
Entering edit mode

While we have an explanation it is odd that one partition (I assume you mean job scheduler queue and not an actual disk partition) allows TBB library to be used in parallel mode and another does not. Do you have two installs of salmon?

ADD REPLY
0
Entering edit mode

Yes, these are job schedulers that are not on different disk partitions. No, salmon is installed globally on the cluster and I call the specific version from the module. So I do not think there might be a conflict issue.

ADD REPLY

Login before adding your answer.

Traffic: 1493 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6