Question

Which tools use or rely on the XA tag in SAM/BAM/CRAM files?

0

Entering edit mode

15 days ago

Maarten • 0

I'm working on a project that needs to upload 6000+ human 30X WGS crams to a cloud instance for collaboration. We keep a copy of the original data on our infrastructure, and we want to reduce the file size to reduce costs.

Currently, the files contain the optional XA:Z tag, which takes 22% of the total filesize and looks to me like a candidate to remove. I understand that BWA generates this tag to represent alternative hits for a read. However, I do not know of other tools that use this tag.

My questions are:

Which tools (other than BWA and bwa.js) actively use, parse, or rely on the XA tag for downstream analysis?
Is this tag used by any variant callers, alignment post-processors, or read mappers?
Is it a good or bad idea to remove these tags?

cram XA compress • 884 views

ADD COMMENT • link updated 13 days ago by JustinZhang ▴ 130 • written 15 days ago by Maarten • 0

0

Entering edit mode

files contain the optional XA:Z tag, which takes 22% of the total filesize

Curious as to how you calculated this?

If you are that worried (and have some cash available) then https://www.genozip.com/ has been praised by some users on Biostars.

When buying a proprietary tool there is a risk that others may not be able to access the data. I don't know if a free decompressor is available for others to use.

Author posted here that they will make it available free for academic use (in case you qualify) --> Launched: Genozip 15 with co-compression of BAM and FASTQ

Original post has the associated publication: Genozip: A new compression tool for FASTQ, BAM, VCF and more

ADD REPLY • link 15 days ago by GenoMax 150k

0

Entering edit mode

Curious as to how you calculated this?

I used samtools cram-size to calculate this.

I already played around with genozip, and it is great for cold storage; however, the biggest issue you have to convert genozip into a bam/cram before using it. Decompressing is always free, so that is not a financial issue.

The idea for uploading into the cloud is that other researchers can analyze the data, and decompressing the data before use increases the difficulty of using the data, and the idea of the cloud solution is to make it easier.

When buying a proprietary tool there is a risk that others may not be able to access the data. I don't know if a free decompressor is available for others to use.

Keeping this amount of data online in the cloud would cost 16K a year, so there is a bit of money

ADD REPLY • link 15 days ago by Maarten • 0

0

Entering edit mode

Keeping this amount of data online in the cloud would cost 16K a year, so there is a bit of money

True, but if you expect a number of people will use the data then that is cost of doing science/being a good researcher and shielding people from having to go through extra hoops to work with the data.

Any money you may save upfront by removing the tags may be negated if you end up having to reupload the data (seems like a good chunk) in terms of your time/effort. So it may be best to leave things alone as is.

ADD REPLY • link 15 days ago by GenoMax 150k

0

Entering edit mode

Could you at least give two records in the CRAM file for example?

ADD REPLY • link 14 days ago by JustinZhang ▴ 130

0

Entering edit mode

I am unable to paste it here(complains about language), so I made a sniplet at https://pastebin.com/3sKa7i9c

ADD REPLY • link 13 days ago by Maarten • 0

0

Entering edit mode

I see, it's extremely long and nested, no wonder the file size explodes.

ADD REPLY • link 13 days ago by JustinZhang ▴ 130

score 0 · Answer 1 · 2025-04-10

0

Entering edit mode

14 days ago

Istvan Albert 102k

One workaround could be to remove these tags and realign these sequences if you ever needed the XA tags alone.

The mapping quality will be zero for these reads, so you could identify and remap them at a later time to produce the alternative alignments if needed.

I have very rarely needed the actual content of XA tags, and almost always as some sort of genomic rearrangment, structural variation type of studies.

Carrying a 16K extra cost (which might even increase) to store data you don't need adds up over time quite a bit.

ADD COMMENT • link 14 days ago by Istvan Albert 102k

0

Entering edit mode

remove these tags and realign these sequences

While we don't know how many other labs/users are going to use this data, since there are 6000+ samples others may not have the resources to redo what has already been done. Plus the storage in cloud may not be indefinitely required, so spending $16K for one year may actually be a good investment for this lab. Especially if a significant number of other labs plan to use this data.

ADD REPLY • link 14 days ago by GenoMax 150k

0

Entering edit mode

I think it is important to frame the tradeoff correctly.

It is not about whether the data is online or not; it is about whether the XA tags will ever be needed or not.

The rest and the essential data would still be stored and shared accordingly. Is storing XA tags that nobody needs worth X amount per year?

ADD REPLY • link 14 days ago by Istvan Albert 102k

0

Entering edit mode

It is not about whether the data is online or not; it is about whether the XA tags will ever be needed or not

OP said that the data has to be made available in cloud. We (and OP) don't know if the consumers of this data will ever need the XA tag at this time. Since the work has already been done (XA tags exist) it is now a matter of deciding whether a bunch of additional work should be done to remove that existing information. If something gets messed up in the process then it would cause more headaches for many others later.

Like you said, only the fastq data needs to be minimally shared so everyone else can do what they want to. But it sounds like in this case the decision has been made to share the pre-aligned data.

ADD REPLY • link 14 days ago by GenoMax 150k

0

Entering edit mode

To expand on this you could perhaps store the multi mapped alignments separately, or just the XA tags separately in a much cheaper storage that takes longer to access. That way if you ever need the information you can restore it but would not cost the same.

ADD REPLY • link 14 days ago by Istvan Albert 102k

0

Entering edit mode

The mapping quality will be zero for these reads, so you could identify and remap them at a later time to produce the alternative alignments if needed.

Reads with XA tags do not have a mapq of zero, to make matter a bit more complicated.

I know that cram v4 might have an tokenizer for this data and be lots smaller: however, before this is years would take years(for cram 3.1 it took about 6 years from final version to default (that is next samtools version btw).

I have very rarely needed the actual content of XA tags, and almost always as some sort of genomic rearrangment, structural variation type of studies.

The reason to share the cram is that other can do "funky" stuff. We already deliver a joint called vcf, so low hanging fruit is already there.

Carrying a 16K extra cost (which might even increase) to store data you don't need adds up over time quite a bit.

These 16k are for the full dataset with XA: the XA part would be around 3.5K. I think for now it is worthwhile to keep them in.

ADD REPLY • link 13 days ago by Maarten • 0

score 0 · Answer 2 · 2025-04-10

My Answer:

I think you should try at least. It's not a issue of reputation / cost, but a matter of statistics.

Clue:

use awk / sed + sqlalchemy / polars / connectorx / pyarrow to extract the XA tag along with the line number of corresponding records and replace all values of XA tag with * in the original cram files.
Parser the XA tags and save it as a database (SQLite / MariaDB / any other open-source database you like). For every alternative hit in XA tag of all records from all sample.cram files, there should be at least these columns in the main table in the database:
```
sample_id,line_number,chr,pos,CIGAR,NM
```
and of course you need additional tables to hash the chr, pos, CIGAR and NM.
write a python script / shell script / C / Rust binary to retrieve info for one certain line in a sample.cram file, with multi-threading / multi-processing capability.
share the database and executable along with your cram files. Tell your colleagues that if XA tag is needed in the downstream analysis, run your executable to retrieve the XA tag first.

It's quite easy yet somewhat labor-intensive.

Your Todo: Try parse all XA tags from several cram files into fields and pool them into a dataframe:

chr,pos,CIGAR,NM

Note that there may be mutiple items in a single XA tag, resulting in multiple rows. Draw the scatterplot with fitting curve of reads_count vs. {x}, as well as another scatterplot of sample_count vs. {x}, where x could be one of the following: unique chr-pos-CIGAR-NM combinations, unique chr, unique pos, unique CIGAR, unique NM.

I think you should at least try 10 to 50 sample.cram files, starting from 1. This number actually depends on the sample heterogeneity. You can use simple random sampling / stratified random sampling / selecting samples with the greatest heterogeneity (e.g. 10 samples with 10 different types of cancer) to estimate how bad it could be.

If a converging trend is observed in the scatterplots with acceptable database size, you should do it totally. Moreover, you could just save the temporary dataframe as parquet.zst using pyarrow / polars and watch the trend of file size.

Reasons:

XA tag is used by many variant callers, especially some somatic variant callers (e.g. LANCET) and methylation counters (irrelevant here).
The size of a compressed file is always about the complexity / entropy. The XA tag is structured and with certain upper limit.

BWA old manual:
```
XA  Alternative hits; format: (chr,pos,CIGAR,NM;)*
```
which means it can be compressed and it's easy to infer that the benefit (compression ratio) is positively related with sample count, and negatively with sample heterogeneity (race, gender, disease/health status...) and the length of sequenced genomic regions.