Hi all,
I work with wheat which has annoyingly large chromosomes (most >600 Mbp), and I've stumbled across this limitation of the BAI indexing format (which indexes .bam files) in the samtools index
manual:
"The BAI index format can handle individual chromosomes up to 512 Mbp (2^29 bases) in length. If your input file might contain reads mapped to positions greater than that, you will need to use a CSI index."
I don't suppose anyone knows if this limitation has been quietly fixed over the years? If not I'm a bit stuck as it seems that the .bam/.bai combination of files is required for some key pieces of software I'd like to use, e.g. IGV and ATACseqQC (Bioconductor). Unless the CSI index format is functionally interchangeable with BAI for such software?
Cheers!
As of
v.2.4.x
IGV supports.csi
format index files.but as it it based on htsjdk, IGV doesn't support the long chromosomes (length> INT32_MAX)
Hi Pierre, thanks for your response. Could you elaborate on what (length> INT32_MAX) means? Does it translate to 2^32 bases?
yes , that is what I meant. Htsjdk use int32 for the number of bases in the reference
Thank you, that's very helpful! And is that per reference sequence? e.g. if I had, say, 20 chromosomes in my reference genome each <512 Mb (2^29), then the total comes out to more than 2^32. Would this still work?
no, the 2^32 is the max for ONE chromosome, not for the size of the genome (2^64)
Thanks for confirming that, cheers!
Thanks for your response! Good to know that
.csi
works in IGV. Unfortunately I'll need to use other downstream software too. I'm guessing from the fact that it was only since v.2.4.x that IGV supported it that it is by no means 'default' for a programme utilising.bai
to also be able to use.csi
?not a lot of tools even use the BAM indexes, only tools that need to sub-select a BAM file for a specific coordinate range for example, so it tends to be a genome browser type use case. you could also consider converting to CRAM and if the tool supports CRAM and not CSI, CRAM index should work with it (CRAM uses a CRAI index, not CSI and allows large genomes). You also mention FAI, FAI is a text based format, not a binary one, so wouldn't generally have these limitations on maximum int size, and also FAI is for FASTA indexing (e.g. FAI lets you get a slice of the FASTA at some given coordinates without reading whole file) not BAM (BAI/CSI allow getting a slice of BAM without reading the whole file)
Hi cmdcolin, thanks for your response!
RE FAI, that was typo on my part, apologies, I did mean BAI. I have edited to reflect this.
Interesting to hear that the BAM indexes aren't that widely utilised... that's good news for sure! However, even if there isn't a requirement for an index, might it be the case that a piece of software can't handle reference sequences >2^29 bases?
e.g. I have since tried to use an R/Bioconductor package called ATACseqQC and it threw up an error relating to reference genome size:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'X' in selecting a method for function 'sapply': 'end' must be <= 536870912
file: A_50K_CER_sorted_unimappers.bam
index: A_50K_CER_sorted_unimappers.bam
The number quoted in the error (536870912) is equal to 2^29.
You can break your largest chromosomes into two (or more pieces, if needed) pieces so the individual pieces stay under the limit. Then you should be able to use all tools (in theory) with this reference. You will need to figure out how to manage/keep track of the overlaps (if you choose to break chromosomes that way).
Indeed, that's the conclusion I've come to as well. Bit of a pain, as will need to re-coordinate annotation files for the split chromosomes too. Thanks for your insight!