Hello everyone, hope you are doing well
SAM/BAM file format specification says that there is a limit to chromosome size that prevents indexing of .bam file, if reference genome had exceptionally large chromosomes. The limit is 2^29-1, which is around 500 M.b.p. This is quite a lot, for example, all human chromosomes are smaller than this, so this limitation does not get in the way very often. However, some organisms, namely barley and wheat, actually have chromosomes around 600 M.b.p. long, so with these genomes it can be an obstruction. I've just tried it:
samtools index file_sort.bam
and it gave the following error:
[E::hts_idx_check_range] Region 536962398..536962445 cannot be stored in a bai index. Try using a csi index with min_shift = 14, n_lvls >= 6
[E::sam_index] Read 'NB552414:80:H3WTKBGXK:3:21607:22796:5232' with ref_name='2H', ref_length=665585731, flags=16, pos=536962399 cannot be indexed
samtools index: failed to create index for "file_sort.bam": Numerical result out of range
Has anyone here encountered this ever before? If yes, how can this be handled? For example, one can split chromosomes into chunks of some 300 M.b.p. to prevent the error from happening. Or am i being too paranoid and just see issues where there are none? Thanks for any help in advance, Nick Shmakov, jr researcher, ICG SB RAS
If you absolutely need a bai index you'll need to break your chromosomes into smaller contigs. If you can use a csi index instead they support sizes greater than the bai index limit given the appropriate min_shift parameter.
Thank you for your suggestion, unfortunately not everything works with .csi file format. But apparently you don't need indexes at all for snp calling with bcftools
I suggest splitting your chromosomes into small chunks. Maybe you don't need index for variant calling, but many programs/software for subsequent analyses cannot process long chromosomes. How long is your longest chromosome? In my experience, many programs/software cannot handle chromosomes longer than 2^31-1 bp. And the worst thing is many of them don't even show an error or warning message.