Seeing as snpEff does not show any preconfigured databases for the latest Bos taurus assembly (i.e. ARS-UCD1.2), I've been trying to build one myself following the human database example from the manual:
# Go to SnpEff's install dir
cd ~/snpeff
# Create database dir
mkdir data/GRCh37.70
cd data/GRCh37.70
# Download annotated genes
wget ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz
mv Homo_sapiens.GRCh37.70.gtf.gz genes.gtf.gz
# Download proteins
# This is used for:
# - "Rare Amino Acid" annotations
# - Sanity check (checking protein predicted from DNA sequences match 'real' proteins)
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.70.pep.all.fa.gz
mv Homo_sapiens.GRCh37.70.pep.all.fa.gz protein.fa.gz
# Download CDSs
# Note: This is used as "sanity check" (checking that CDSs prediscted from gene sequences match 'real' CDSs)
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.70.cdna.all.fa.gz
mv Homo_sapiens.GRCh37.70.cdna.all.fa.gz cds.fa.gz
# Download regulatory annotations
wget ftp://ftp.ensembl.org/pub/release-70/regulation/homo_sapiens/AnnotatedFeatures.gff.gz
mv AnnotatedFeatures.gff.gz regulation.gff.gz
# Uncompress
gunzip *.gz
# Download genome
cd ../genomes/
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz
mv Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz GRCh37.70.fa.gz
# Uncompress:
# Why do we need to uncompress?
# Because ENSEMBL compresses files using a block compress gzip which is not compatibles whith Java's library Gunzip
gunzip GRCh37.70.fa.gz
# Edit snpEff.config file
#
# WARNING! You must do this yourself. Just copying and pasting this into a terminal won't work.
#
# Add lines:
# GRCh37.70.genome : Homo_sapiens
# GRCh37.70.reference : ftp://ftp.ensembl.org/pub/release-70/gtf/
# Now we are ready to build tha database
cd ~/snpeff
java -Xmx20g -jar snpEff.jar build -v GRCh37.70 2>&1 | tee GRCh37.70.build
I've skipped regulatory annotations (because I don't really need those) and renamed the assembly file as sequences.fa
, because, contrary to instructions, snpEff.jar build
will not recognise the assembly otherwise (the logs are quite explicit about it). Here is some tree
output:
${PWD}/snpEff/data/ARS-UCD1.2/
├── cds.fa -> ${PWD}/ref/ars/cdna/Bos_taurus.ARS-UCD1.2.cdna.all.fa
├── genes.gtf -> ${PWD}/ref/ars/anno/Bos_taurus.ARS-UCD1.2.97.gtf
├── protein.fa -> ${PWD}/ref/ars/pep/Bos_taurus.ARS-UCD1.2.pep.all.fa
├── sequences.fa -> ${PWD}/ref/ars/asm/Bos_taurus.ARS-UCD1.2.dna_sm.toplevel.fna
Here are the lines I've added to the config file:
# Bos taurus (ARS-UCD1.2)
ARS-UCD1.2.genome : Bos_taurus (ARS-UCD1.2)
ARS-UCD1.2.MT.codonTable : Vertebrate_Mitochondrial
When I run java -Xmx20g -jar snpEff.jar build -v ARS-UCD1.2 2>&1 | tee ARS-UCD1.2.build
, everything seems to work fine except for the fact that snpEff seems to be utterly unable to locate any proteins during the final protein check. Here is the log's tail:
....................................................................................................
....................................................................................................
.......................................
Protein check: ARS-UCD1.2 OK: 0 Not found: 37538 Errors: 0 Error percentage: NaN%
00:01:02 Saving database
00:01:26 [Optional] Reading regulation elements: GFF
00:01:26 Warning: Cannot read optional regulation file '/home/ilia/projects/ksi/selection/snpEff/./data/ARS-UCD1.2/regulation.gff', nothing done.
00:01:26 [Optional] Reading regulation elements: BED
00:01:26 Cannot find optional regulation dir '/home/ilia/projects/ksi/selection/snpEff/./data/ARS-UCD1.2/regulation.bed/', nothing done.
00:01:26 [Optional] Reading motifs: GFF
00:01:26 Warning: Cannot open PWMs file /home/ilia/projects/ksi/selection/snpEff/./data/ARS-UCD1.2/pwms.bin. Nothing done
00:01:26 Done
00:01:26 Logging
00:01:27 Checking for updates...
NEW VERSION!
There is a new SnpEff version available:
Version : 4.4
Release date : 2019-01-26
Download URL : http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
00:01:28 Done.
For comparison, CDS checks are fine
CDS check: UMD3.1 OK: 22914 Warnings: 5165 Not found: 3826 Errors: 0 Error percentage: 0.0%
00:00:39 done
I have no idea, why this is happening, and Google hasn't been particularly helpful either. I am not even sure I should be worried about this check, because snpEff
itself does not seem to consider it critical enough to abort the process.
P.S.
I have no idea, why the log reports about an update, because I've literally downloaded the very same release.