Forum:How should reference genome fasta files be distributed by UCSC?
3
4
Entering edit mode
21 months ago

Dear genomics-tools-users and Istvan Albert

I work at UCSC and have a question on how to weigh consistency of links versus data updates.

TLDR: Should UCSC change the main {hg19,hg38}.fa.gz when the GRC releases a new patch?

Traditionally, the hg19 and hg38 fasta files were distributed at https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ and https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

Because big fasta files don't go into Git, there are a lot of genomics pipeline scripts that start with a wget of one of these files: https://github.com/search?q=%22hgdownload%22+%22goldenpath%22+%22ucsc%22+%22hg19.fa.gz%22+%22bigZips%22&type=code

However, the GRC started to add "patches" to these genomes around a decade ago so we had to update our reference fasta files. We thought that if we change the old file, a lot of pipelines out there (unrelated to genome browsers) may suddenly output different results, so we decided to put the updated "patched" genome file into https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/ (note the addition of the "latest/" directory, same for hg19)

However, now the files in the "bigZips" directory don't correspond to what's shown in the genome browser anymore. Anyone who is used to our directory structure will stumble over this sooner or later, they have to discover the "latest" directory first, or otherwise they will run into chrom names that are not in the chrom.sizes file, or sequences that they cannot find in what they think is the reference genome fasta. With more and more patches, this is getting more of an issue.

What do you think is more important: keep the fasta file stable for the larger genomics ecosystem (does anyone actually care about this fasta file that much?) and do not touch a reference genome once released or keep our directory structure consistent for genome browser users and just overwrite the main fasta file with every patch update ?

reference-genome fasta freeze ucsc-genome-browser • 2.7k views
ADD COMMENT
1
Entering edit mode
21 months ago

However, the GRC started to add "patches" to these genomes around a decade ago so we had to update our reference fasta files

Oh no! I don't think you're supposed to do that. A freeze is a freeze.

ADD COMMENT
2
Entering edit mode

Patch "updates" are not changing the chromosomes. Patches are simply adding the _fix and _alt files to the genome reference.

$ diff hg38.chrom.sizes.orig hg38.chrom.sizes.latest
24a25,26
> chr8_KZ208915v1_fix   6367528
> chr15_ML143371v1_fix  5500449
25a28
> chr15_KN538374v1_fix  4998962
34a38,40
> chr15_KQ031389v1_alt  2365364
> chr5_MU273354v1_fix   2101585
> chr16_KV880768v1_fix  1927115
37a44
> chr1_MU273333v1_fix   1572686
40a48
> chr15_MU273374v1_fix  1154574
46a55,56
> chr12_KZ208916v1_fix  1046838
> chr21_MU273391v1_fix  1020778
50a61
> chr2_MU273342v1_fix   955087
52a64,65
ADD REPLY
1
Entering edit mode

i think 99.999% of users do not want to deal with patches. this is very esoteric stuff.

ADD REPLY
1
Entering edit mode

That is what I think, too, but you're only one voice and that makes two of us who think this. This thread has attracted only three users. I guess there is no way to answer this question on how important these darn patches are. Someone may just have to take a decision.

ADD REPLY
0
Entering edit mode

Example: over the last two days, the reference genome has been downloaded around 400 times. Only 180 of these were downloads from the "latest" directory, the others were the initial genome version, without the patches and fixes. But were these intentional? Maybe people who download the "latest" version don't actually care about the fixes they just go to the default link. Maybe the people who download the "initial" version just follow some incoming external link or copy/paste from an old script or they are not people but scripts and if we change the .fa.gz file now, this would break a lot of scripts out there.

ADD REPLY
1
Entering edit mode

Either way, you are the third user here who doesn't care about the patches. Thank you for your feedback and opinion. We'll leave the situation as it is.

ADD REPLY
0
Entering edit mode
21 months ago
GenoMax 147k

It would be difficult to get a consensus since casual users may only care about a "genome reference" without worrying about the patches. I don't see people going to UCSC as their main source of reference genomes (at least here) so this may mainly impact genome browser users.

However, now the files in the "bigZips" directory don't correspond to what's shown in the genome browser anymore. Anyone who is used to our directory structure will stumble over this sooner or later, they have to discover the "latest" directory first, or otherwise they will run into chrom names that are not in the chrom.sizes file, or sequences that they cannot find in what they think is the reference genome fasta. With more and more patches, this is getting more of an issue.

This is an important point and thank you for making a note of this. There is a set of tools that make use of the chrom.sizes file so not using the exact set of files is going to cause problems with those tools.

our directory structure consistent for genome browser users and just overwrite the main fasta file with every patch update ?

Considering all things this may be the way to go. You could include a link to the original bigzip folder in the README (so make latest patched fasta the default and the original genome release archive). You should also include a link on browser page to the new directory so people can grab the same exact version of fasta that is visible in the genome browser and note that it is always the latest patched version.

ADD COMMENT
0
Entering edit mode

I don't see people going to UCSC as their main source of reference genomes (at least here) so this may mainly impact genome browser users.

I sent a link to a Github search with my question above where you can see that a lot of software uses the hg19 files from UCSC, and a lot of that software has little to do with the genome browser, so no, I don't agree with this statement. I think many people are getting the reference genome from UCSC, I don't know why.

Where do you get your reference genome from? NCBI? If so, by default, their file has NC_xxxxx chromosome names, which is not very useful for your daily work. Or are you using the NCBI "for analysis" fasta file? Do you get your reference genome elsewhere?

ADD REPLY
0
Entering edit mode

I see many questions on Biostars about "top level/primary" genome files from Ensembl but I don't think I have seen a question about bigZips/latest folders. Perhaps people find the names self-explanatory and they pay attention to the folder dates they see on these folders.

Many casual users likely get their genome files/aligner index bundle from resources such as iGenomes. People have been going to GENCODE as well.

Going back to your original post I don't fully understand this statement

However, now the files in the "bigZips" directory don't correspond to what's shown in the genome browser anymore.

Main chromosomes have never changed compared to the original release so what exactly is different (or a concern) in genome browser? As discussed elsewhere few people likely care about genome patches or pay attention to them.

ADD REPLY
0
Entering edit mode

Yes, I think you're right, casual users get their genomes from whereever they get the aligner from. So this problem affects only few people.

Main chromosomes have never changed compared to the original release so what exactly is different (or a concern) in genome browser? As discussed elsewhere few people likely care about genome patches or pay attention to them.

These files do not correspond to the genome browser's genome anymore. Yes, this may be a very minor deal, I agree. Apparently on one really cares about this. Which is great, it makes our lives easier.

Many thanks for your comments!

ADD REPLY
0
Entering edit mode
21 months ago

In my opinion, data is software (and a better way to put it is that software is data).

Hence all rules of versioning, releases, etc, that the community adopted for software should also be adopted for data.

With that assumption in place, any ambiguity can be resolved if we ask ourselves, if data X were really software X, what would be the proper course of action?

ADD COMMENT
0
Entering edit mode

I agree with the reasoning but since the releases are controlled/done by other entities (more or less one unique authority for each model organism) it would be more confusing to add any kind of versioning system on top of the patch designations etc. There is probably small chance of getting all groups that control genome releases/patches to agree on a common nomenclature.

ADD REPLY
0
Entering edit mode

Hence all rules of versioning, releases, etc, that the community adopted for software should also be adopted for data.

Of course, which is why the genomes have a version number. But they didn't used to have one so there is one "initial" file without a version number. This was my question: should we replace this "initial" file? If you think "sure, just update everything" then maybe you don't have scripts on your hard disk that start with a wget request of the reference genome and this scripts doesn't have checks afterwards and you have never run into genomics pipeline output files where suddenly results on _alt and _fix chromosomes appear, after someone at UCSC updated a genome file and back when you wrote this script you probably thought this reference genome would never change.... ? :-)

Again: data files are not exactly like software. Genomics files are not in git, they're too big. They get downloaded by the software which means that the URL of the data files may matter.

Discussing this on Biostars is very interesting, thanks, but: I think I am unable to explain the problem well and also, it may not matter that much if no one is yelling here. :-)

ADD REPLY
0
Entering edit mode

the similarity I alluded to corresponds to how we use data - we use it like software.

When we download software from its default location, what version do we really want? Do you want an esoteric outdated version or the latest one?

The same is true with data. When we run any analysis, we should get the best and most reliable data release.

If our workflow critically depends on data being one specific version at the risk of it being outdated, then it is incumbent on the requester to obtain the correct version of the data. The provider's responsibility is only to provide a link that is properly tagged to that release.

So the solution, in my mind is very simple. The default USCS data should link to the latest release, and should be tagged as "latest". Other releases should be tagged accordingly and not change.

ADD REPLY
0
Entering edit mode

Yes, of course, I understand. We have "latest/" and "initial" subdirectories now. The problem here is that the original link had no notion of "latest", because in 2013, we had never been in this position before. So it's a matter of opinion what these old links should point to.

It sounds like this is a minor issue for the people here and I'm just overly nervous about changing a file. I think we'll leave it as it as-is for now and see if someone runs into trouble and complains. Maybe someone will find this thread in 10 years and comment then here. :-)

Many thanks for your comments!

ADD REPLY

Login before adding your answer.

Traffic: 2363 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6