Hello, I am quite confused that from my perspective there were two versions of gene assemblies. First is from Genome Reference Consortium and second from Ensembl. Just recently I figured out that Ensembl is probably just copy sequence from Genome Reference Consortium release builds and does nothing with a sequence. Thus for example it is the same to download these data ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/ and these data ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000001405.31_GRCh38.p5/GCF_000001405.31_GRCh38.p5_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/ The real difference is that Ensembl do some post processing and keeps data in sync with dbSNP information and other types of information maybe more clearly because it needs actually to use these data by its own tools which are more public.
However I would like to know more about gene build and gene annotation process. What are the steps it includes? Who are the people behind assembling sequence and annotations? What type of tools are they using? What public sources of funding are they consuming to perform their work? Are they performing de novo assemblies of Human and other genomes or are they only currating some sequencing results produced decades away?
I feel that we all are discussing different tools to work with their data but I really want to know more about these reference data and how it all emerge.
Thanks Vojtech.
Genome Reference Consortium consists of several institutions and EBI is part of the group: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
There are some slide decks here that describe the process of how assemblies are put together: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/workshops/
Thank you, very interesting reading.
We in Ensembl work on the annotation of assemblies, we do not produce assemblies. We need the assemblies to be publicly available so that we can import the sequence and run our analyses to annotate genes, variation data, orthologues and paralogues and regions involved with gene regulation. We don't change the genomic sequence but we add value to it by annotating it. The bunch of 'acgt's are meaningless without the annotation we provide. For human and mouse, we get the assemblies from GRC. In Ensembl, we've got a team behind the annotation of genes, the Ensembl Genebuild team. Detailed information on the genebuild of the human assembly can be found here. Have also a look at our papers for additional information, more specifically Ensembl 2016.
Thank you for your answer. I think it is wise to use one version of reference sequence. I suggest that in the readme files ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/README there should be probably stated that it is the same sequence that is also known by name GRCh38.p5 because everyone should not know GCA_000001405.20 identifier and that is actually the same. At least for me it was difficult to realize and up to now I used to refer to it as to GRCh38.ensembl84.
I appreciate the amount of work you do and I would love to see the details.
Sometimes I am a bit confused by the number of different annotations that are produced and how to decide which one is actually "better". There should definitely be some effort to somehow converge and connect results from NCBI and EMBL-EBI and other places.
I'm with you regarding your suggestion. At the moment we state that GRCh38.p5 = GCA_000001405.20 on our annotation page and when using this REST endpoint. Will check if we can include this info on the README of the DNA fasta file on the FTP. If so, we will updating that in forthcoming releases as new patches will be incorporated into the primary assembly e.g. GRCh38.p7 = GCA_000001405.22.
On the README, we state the assembly in that file GCA_000001405.20. A simple search on the web would take you to this page on the NCBI, where we've got the correspondence between GCA_000001405.20 and GRCh38.p5.
When I said the assemblies need to be publicly available I should have made clear this means 'submitted to Genbank, ENA or DDBJ (aka INSDC). Some assemblies can be available but not submitted to one of those consortia.