As part of the ENCODE pilot project, 44 regions representing about 1% of the human genome were selected for a community annotation effort. This paper describes how the community was involved in a sort of annotation contest and how the submitted annotations were compare against a high-quality reference--annotations generated by GENCODE with extensive manual curation and experimental validation.
It has been less than 5 years since this was published, and yet I am having serious issues tracking down the data associated with this pilot project. The paper refers the reader to this website, but I cannot find the data there. The page says it is no longer maintained and that the project has moved the Sanger. So...I poked around on Sanger's website for a while and finally found a link to this website, which seems to be the current home for GENCODE. Unfortunately, all the links I've followed (the Sanger FTP site and the UCSC ENCODE page) have taken me to data for the current (production) phase, not the pilot phase, which is what I'm looking for. I got my hopes up momentarily when I saw a link on the UCSC ENCODE page for "Pilot Project." The page described exactly what I was looking for, but I cannot seem to find anywhere that will allow me to download the data.
Can anyone point me to where I can download these data? In particular, I am looking for:
- Nucleotide sequences for the 44 genomic regions released during the ENCODE pilot project and as part of the EGASP project/competition
- The GENCODE annotations for these 44 regions that were used as a standard reference in the EGASP project/competition (hopefully in some common text tab-delimited format...or a format with at least some documentation)
Thanks!
Edit: After some suggestions and poking around, I was able to find the data I was looking for. The nucleotide sequences are at ftp://genome.imim.es/pub/projects/gencode/data/seqs/44_ENCODE_regions_NCBI35.fa and annotations for protein-codding genes can be found at ftp://genome.imim.es/pub/projects/gencode/data/havana-encode/version00.2_29apr05/ENCODE_coord/genes_with_cds/44regions_coding.gff.gz.
@Bert This was indeed very close. I poked around in these directories and was able to find what I'm looking for. Thanks!