Dear genomics-tools-users and Istvan Albert
I work at UCSC and have a question on how to weigh consistency of links versus data updates.
TLDR: Should UCSC change the main {hg19,hg38}.fa.gz when the GRC releases a new patch?
Traditionally, the hg19 and hg38 fasta files were distributed at https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ and https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
Because big fasta files don't go into Git, there are a lot of genomics pipeline scripts that start with a wget of one of these files: https://github.com/search?q=%22hgdownload%22+%22goldenpath%22+%22ucsc%22+%22hg19.fa.gz%22+%22bigZips%22&type=code
However, the GRC started to add "patches" to these genomes around a decade ago so we had to update our reference fasta files. We thought that if we change the old file, a lot of pipelines out there (unrelated to genome browsers) may suddenly output different results, so we decided to put the updated "patched" genome file into https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/ (note the addition of the "latest/" directory, same for hg19)
However, now the files in the "bigZips" directory don't correspond to what's shown in the genome browser anymore. Anyone who is used to our directory structure will stumble over this sooner or later, they have to discover the "latest" directory first, or otherwise they will run into chrom names that are not in the chrom.sizes file, or sequences that they cannot find in what they think is the reference genome fasta. With more and more patches, this is getting more of an issue.
What do you think is more important: keep the fasta file stable for the larger genomics ecosystem (does anyone actually care about this fasta file that much?) and do not touch a reference genome once released or keep our directory structure consistent for genome browser users and just overwrite the main fasta file with every patch update ?
Patch "updates" are not changing the chromosomes. Patches are simply adding the
_fix
and_alt
files to the genome reference.i think 99.999% of users do not want to deal with patches. this is very esoteric stuff.
That is what I think, too, but you're only one voice and that makes two of us who think this. This thread has attracted only three users. I guess there is no way to answer this question on how important these darn patches are. Someone may just have to take a decision.
Example: over the last two days, the reference genome has been downloaded around 400 times. Only 180 of these were downloads from the "latest" directory, the others were the initial genome version, without the patches and fixes. But were these intentional? Maybe people who download the "latest" version don't actually care about the fixes they just go to the default link. Maybe the people who download the "initial" version just follow some incoming external link or copy/paste from an old script or they are not people but scripts and if we change the .fa.gz file now, this would break a lot of scripts out there.
Either way, you are the third user here who doesn't care about the patches. Thank you for your feedback and opinion. We'll leave the situation as it is.