Entering edit mode
7.7 years ago
retrogenomics
▴
30
Hi,
I have a folder which contains all my genome reference sequences (in fasta format, ex: hg19.fa, hg38.fa, mm10.fa, etc...) and I would like to store the index files generated by each short read mapper (ex: bwa, bowtie, ...) in a different folder. My problem is the following:
Is it possible with bwa aln/samse
to specify the location of the index? A call to bwa would look like:
bwa aln <ref_genome.fasta> <reads.fastq> | bwa samse <ref_genome.fasta> - <reads.fastq> > mapped_reads.sam
Thanks
To avoid having multiple copies of the fasta file.
A symbolic link back to the original fasta file from each of the index directories would do the trick as suggested.
We have shared reference sequences in our server, but not everyone does use the same short read aligner, and thus the indexes can be stored in private directories. In addition each short read aligner generates its own files for indexing, and it is somehow difficult to keep track of what is what. Yes, the symbolic link in the index directories is a good idea. Thanks.
That does not make sense. You are avoiding having multiple copies of the reference but potentially allowing multiple private copies of the index files (which are larger). Keeping all of these in a common location (and managed by centrally) is extremely useful.
I like the organization that iGenomes comes with. Under a "Sequence" directory store the sequence as well as separate directories for all aligner indexes that people use at your facility.
I agree. Still I'm wondering how much an index is dependent of the version of the aligner used to make it. Do you think it could be a problem?
Independently of this, as for the shared reference/indexes, I'm wondering: if several users of a common server would start to map reads on the same reference genome, whether it will be slowed down or problematic in any way.
Aligners change the indexing scheme rarely (Only examples I can think of are when
bwa
went from 0.6.x to 0.7.x andSTAR
may have done so once). A big change like this would generally be well published so it will give you time to react accordingly.If the users are hitting the same storage system then having indexes in one location as opposed to several may not cause a significant effect (in terms of time or I/O problems). On a shared server/cluster you likley have a high performance shared storage solution.
How to call bwa index on a cluster when using iGenomes separate folders for fasta genome and index files? I'm using the following command line
bwa mem -t 20 Sequence/WholeGenomeFasta/genome.fa sample_1_cleaned.fastq sample_2_cleaned.fastq
but I get the same error
[E::bwa_idx_load_from_disk] fail to locate the index files
You don't specify the fasta file but you use the base name of the genome index. So it should be something like