I want to parse and index genomes in embl format compressed with gz/xz which I downloaded from EBI. The problem is that I can only work with the uncompressed files to list some features. But the uncompressed files are too big.
As far as I understood, if I compress them using bgzip from tabix, I can use them in Biopython [1] Can I directly index the xz file compressed with LZMA2 which gives much smaller file? It should be possible in principle [2] I am wondering if anyone has done it.
How much data are we talking about zipped/unzipped. It may be wiser to get more disk space. Random access to zipped data is complex, and I am a bit sceptic. Use UCSC 2bit format and indexing?
Thank you for your info. Its around 110 GB with gz compression. Indexing will allow fast access to various parts of the file.