Custom database in snpEff (Bos taurus ARS-UCD1.2)
0
0
Entering edit mode
5.4 years ago
Eli Korvigo ▴ 230

Seeing as snpEff does not show any preconfigured databases for the latest Bos taurus assembly (i.e. ARS-UCD1.2), I've been trying to build one myself following the human database example from the manual:

# Go to SnpEff's install dir
cd ~/snpeff

# Create database dir
mkdir data/GRCh37.70
cd data/GRCh37.70

# Download annotated genes
wget ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens/Homo_sapiens.GRCh37.70.gtf.gz
mv Homo_sapiens.GRCh37.70.gtf.gz genes.gtf.gz

# Download proteins 
# This is used for:
#   - "Rare Amino Acid" annotations
#   - Sanity check (checking protein predicted from DNA sequences match 'real' proteins)
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/pep/Homo_sapiens.GRCh37.70.pep.all.fa.gz
mv Homo_sapiens.GRCh37.70.pep.all.fa.gz protein.fa.gz

# Download CDSs
# Note: This is used as "sanity check" (checking that CDSs prediscted from gene sequences match 'real' CDSs)
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.70.cdna.all.fa.gz
mv Homo_sapiens.GRCh37.70.cdna.all.fa.gz cds.fa.gz

# Download regulatory annotations
wget ftp://ftp.ensembl.org/pub/release-70/regulation/homo_sapiens/AnnotatedFeatures.gff.gz
mv AnnotatedFeatures.gff.gz regulation.gff.gz

# Uncompress
gunzip *.gz

# Download genome
cd ../genomes/
wget ftp://ftp.ensembl.org/pub/release-70/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz
mv Homo_sapiens.GRCh37.70.dna.toplevel.fa.gz GRCh37.70.fa.gz

# Uncompress: 
# Why do we need to uncompress? 
# Because ENSEMBL compresses files using a block compress gzip which is not compatibles whith Java's library Gunzip 
gunzip GRCh37.70.fa.gz

# Edit snpEff.config file
#
# WARNING! You must do this yourself. Just copying and pasting this into a terminal won't work.
#
# Add lines:
#       GRCh37.70.genome : Homo_sapiens
#       GRCh37.70.reference : ftp://ftp.ensembl.org/pub/release-70/gtf/

# Now we are ready to build tha database
cd ~/snpeff
java -Xmx20g -jar snpEff.jar build -v GRCh37.70 2>&1 | tee GRCh37.70.build

I've skipped regulatory annotations (because I don't really need those) and renamed the assembly file as sequences.fa, because, contrary to instructions, snpEff.jar build will not recognise the assembly otherwise (the logs are quite explicit about it). Here is some tree output:

${PWD}/snpEff/data/ARS-UCD1.2/
├── cds.fa -> ${PWD}/ref/ars/cdna/Bos_taurus.ARS-UCD1.2.cdna.all.fa
├── genes.gtf -> ${PWD}/ref/ars/anno/Bos_taurus.ARS-UCD1.2.97.gtf
├── protein.fa -> ${PWD}/ref/ars/pep/Bos_taurus.ARS-UCD1.2.pep.all.fa
├── sequences.fa -> ${PWD}/ref/ars/asm/Bos_taurus.ARS-UCD1.2.dna_sm.toplevel.fna

Here are the lines I've added to the config file:

# Bos taurus (ARS-UCD1.2)
ARS-UCD1.2.genome : Bos_taurus (ARS-UCD1.2)
        ARS-UCD1.2.MT.codonTable : Vertebrate_Mitochondrial

When I run java -Xmx20g -jar snpEff.jar build -v ARS-UCD1.2 2>&1 | tee ARS-UCD1.2.build, everything seems to work fine except for the fact that snpEff seems to be utterly unable to locate any proteins during the final protein check. Here is the log's tail:

        ....................................................................................................
        ....................................................................................................
        .......................................

        Protein check:  ARS-UCD1.2      OK: 0   Not found: 37538        Errors: 0       Error percentage: NaN%
00:01:02        Saving database
00:01:26        [Optional] Reading regulation elements: GFF
00:01:26        Warning: Cannot read optional regulation file '/home/ilia/projects/ksi/selection/snpEff/./data/ARS-UCD1.2/regulation.gff', nothing done.
00:01:26        [Optional] Reading regulation elements: BED 
00:01:26        Cannot find optional regulation dir '/home/ilia/projects/ksi/selection/snpEff/./data/ARS-UCD1.2/regulation.bed/', nothing done.
00:01:26        [Optional] Reading motifs: GFF
00:01:26        Warning: Cannot open PWMs file /home/ilia/projects/ksi/selection/snpEff/./data/ARS-UCD1.2/pwms.bin. Nothing done
00:01:26        Done
00:01:26        Logging
00:01:27        Checking for updates...


NEW VERSION!
        There is a new SnpEff version available: 
                Version      : 4.4
                Release date : 2019-01-26
                Download URL : http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip

00:01:28        Done.

For comparison, CDS checks are fine

        CDS check:      UMD3.1  OK: 22914       Warnings: 5165  Not found: 3826 Errors: 0       Error percentage: 0.0%
00:00:39        done

I have no idea, why this is happening, and Google hasn't been particularly helpful either. I am not even sure I should be worried about this check, because snpEff itself does not seem to consider it critical enough to abort the process.

P.S.

I have no idea, why the log reports about an update, because I've literally downloaded the very same release.

SNP annotation snpEff • 2.2k views
ADD COMMENT

Login before adding your answer.

Traffic: 2618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6