I'm working on building a snpEff db for GRCh38 patch 13 RefSeq assembly. The latest pre-built snpEff db available for RefSeq is patch 7. I need p13 for consistency within pipelines. (don't ask me why patch 13 isn't available in the pre-built snpEff library. They recommend using Ensembl not RefSeq) (version 4.3t latest)
I'm able to follow the documentation and build a db from the RefSeq GTF and FASTA files from NCBI. However there are still problems:
(sorry, I deleted a portion of this question because I figured out I was running the command on another server and the config wasn't synced)
Entries from my snpEff.config file:
#data.dir = ./data/
data.dir = /var/references/snpEff/
...
# GRCh38 current release from NCBI's RefSeq should be p13 not p7
GRCh38.p13.RefSeq.genome : Human genome GRCh38 using RefSeq transcripts
#GRCh38.p13.RefSeq.reference : ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/
GRCh38.p13.RefSeq.M.codonTable : Vertebrate_Mitochondrial
GRCh38.p13.RefSeq.MT.codonTable : Vertebrate_Mitochondrial
Files within /var/references/snpEff/GRCh38.p13.RefSeq/
genes.gtf -> /var/references/ncbi/homo_sapiens/GRCh38/13/annotations/full.gtf
sequences.fa -> /var/references/ncbi/homo_sapiens/GRCh38/13/sequences/full.fa
snpEffectPredictor.bin
This .bin file was built using the snpEff build process in their documentation, using NCBI's RefSeq GTF and Fasta.
Should I build the contigs individually? The internal RefSeq dbs seem to be built as individual contigs, not a single one lumped together.
What about protein and regulatory regions? I'm working on those next. Are they required to run snpEff?
Thanks
You're asking a few questions, and I think the answer is: depends on what you're trying to do, which you haven't explained. I don't think it matters (apart from speed) whether you build a snpeff db for each contig or process them together. I would follow previous examples and process each individually.
If this data is in the
gff
file then it should allow snpeff to annotate your files. if not, you need to either merge gff files or process these separately.It seems to be running fine without having built the reference db with separate fasta files with protein sequence and CDS. The build process can accept those, but I just built the db with genomic dna FASTA.
The GFF and GTF I used have CDS data and some protein annotations, but not the translated protein sequences.
I'm using it for annotating genomic variant calls with the effect prediction and some other VCFs like dbSNP
Just not entirely sure about the inner workings of snpEff and how it calculates the protein effects. My plan is to run with p7 and my p13 and compare.
Also curious if it would be more efficient to use db built from individual contigs. Maybe there is some internal multi-threading that can make use of this. I don't know.