My Answer:
I think you should try at least. It's not a issue of reputation / cost, but a matter of statistics.
Clue:
- use awk / sed + sqlalchemy / polars / connectorx / pyarrow to extract the XA tag along with the line number of corresponding records and replace all values of XA tag with * in the original cram files.
- Parser the XA tags and save it as a database (SQLite / MariaDB / any other open-source database you like). For every alternative hit in XA tag of all records from all sample.cram files, there should be at least these columns in the main table in the database:
sample_id,line_number,chr,pos,CIGAR,NM
and of course you need additional tables to hash the chr, pos, CIGAR and NM.
- write a python script / shell script / C / Rust binary to retrieve info for one certain line in a sample.cram file, with multi-threading / multi-processing capability.
- share the database and executable along with your cram files. Tell your colleagues that if XA tag is needed in the downstream analysis, run your executable to retrieve the XA tag first.
It's quite easy yet somewhat labor-intensive.
Your Todo:
Try parse all XA tags from several cram files into fields and pool them into a dataframe:
chr,pos,CIGAR,NM
Note that there may be mutiple items in a single XA tag, resulting in multiple rows.
Draw the scatterplot with fitting curve of reads_count vs. {x}, as well as another scatterplot of sample_count vs. {x}, where x could be one of the following: unique chr-pos-CIGAR-NM combinations, unique chr, unique pos, unique CIGAR, unique NM.
I think you should at least try 10 to 50 sample.cram files, starting from 1. This number actually depends on the sample heterogeneity. You can use simple random sampling / stratified random sampling / selecting samples with the greatest heterogeneity (e.g. 10 samples with 10 different types of cancer) to estimate how bad it could be.
If a converging trend is observed in the scatterplots with acceptable database size, you should do it totally. Moreover, you could just save the temporary dataframe as parquet.zst using pyarrow / polars and watch the trend of file size.
Reasons:
- XA tag is used by many variant callers, especially some somatic variant callers (e.g. LANCET) and methylation counters (irrelevant here).
The size of a compressed file is always about the complexity / entropy. The XA tag is structured and with certain upper limit.
BWA old manual:
XA Alternative hits; format: (chr,pos,CIGAR,NM;)*
which means it can be compressed and it's easy to infer that the benefit (compression ratio) is positively related with sample count, and negatively with sample heterogeneity (race, gender, disease/health status...) and the length of sequenced genomic regions.
Curious as to how you calculated this?
If you are that worried (and have some cash available) then https://www.genozip.com/ has been praised by some users on Biostars.
When buying a proprietary tool there is a risk that others may not be able to access the data. I don't know if a free decompressor is available for others to use.
Author posted here that they will make it available free for academic use (in case you qualify) --> Launched: Genozip 15 with co-compression of BAM and FASTQ
Original post has the associated publication: Genozip: A new compression tool for FASTQ, BAM, VCF and more
I used samtools cram-size to calculate this.
I already played around with genozip, and it is great for cold storage; however, the biggest issue you have to convert genozip into a bam/cram before using it. Decompressing is always free, so that is not a financial issue.
The idea for uploading into the cloud is that other researchers can analyze the data, and decompressing the data before use increases the difficulty of using the data, and the idea of the cloud solution is to make it easier.
Keeping this amount of data online in the cloud would cost 16K a year, so there is a bit of money
True, but if you expect a number of people will use the data then that is cost of doing science/being a good researcher and shielding people from having to go through extra hoops to work with the data.
Any money you may save upfront by removing the tags may be negated if you end up having to reupload the data (seems like a good chunk) in terms of your time/effort. So it may be best to leave things alone as is.
Could you at least give two records in the CRAM file for example?
I am unable to paste it here(complains about language), so I made a sniplet at https://pastebin.com/3sKa7i9c
I see, it's extremely long and nested, no wonder the file size explodes.