Hello,
I am trying to match Gencode's annotations to assemblies.
It is my understanding that the sequence of reference chromosomes changes only when there is a major version update (e.g. GRCh37 -> GRCh38). For minor versions (such as GRCh38.p2), patches (deltas between the major version and the new minor version) may be added (as well as haplotypes etc).
Gencode releases the following annotation: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.chr_patch_hapl_scaff.annotation.gtf.gz
that matches the following assembly: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/GRCh38.p2.genome.fa.gz
If one doesn't want the patches, he can refer to the primary assembly: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/GRCh38.primary_assembly.genome.fa.gz
which matches the following annotation: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.primary_assembly.annotation.gtf.gz
But then, what is this annotation for? ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_22/gencode.v22.annotation.gtf.gz
According to the description, this annotation describes reference chromosomes only. So why isn't this suitable for the primary assembly?
Also, it is often suggested not to mix and match Ensembl or Gencode annotations with UCSC assemblies, but given that there are 1:1 matchings (such as hg38 = GRCh38) that should be doable, as long as one takes care that chromosome names follow the same convention.
Similarly, if we remove patches and alternate loci from GRCh38.p2, at that point wouldn't we get back to the primary assembly GRCh38, except for differences in scaffolds? Then, if our annotations of choice only describe reference chromosomes, then those annotations, originally meant for GRCh38, would also work fine with GRCh38.p2. Isn't that the case?
Thank you for your help!
Roberto
Could you expand on the "graph genome" part? I haven't seen that term before and couldn't find any other info, but it sounds like the sort of thing that would be good to know more about.
So graph genomes are a new way of representing genomes and all the possible sequences that they could be. Currently the model is a linear sequence, and haplotypes are shown as sequences on top of the genome. This creates a problem of perception: when you see a genome with sequence on top, you see the primary sequence as the "reference" and the haplotypes as "other". It also means that many analysis tools can work only with the primary assembly, and the haplotypes get skipped from analyses. In fact, the haplotypes are just as relevant and important possible sequences that individuals may have. Indeed there are haplotypes that represent certain populations and ethnic groups, so it is important to ensure that all haplotypes and ethnic groups are considered equal in the eyes of the genome.
The solution is a graph genome. Instead of being completely linear, a graph genome consists of a linear sequence that then splits off into different sequences where there are alternative haplotypes. This means that all of the haplotypes become part of the primary assembly, and any analyses will include all possible sequence. Obviously, this will include a massive redesign of various tools to work with these data.
You can find an illustrated example here: https://github.com/adamnovak/schemas/blob/master/doc/GraphModeFAQ.md