Where do these snpeff annotation come from?
1
0
Entering edit mode
11 months ago
curious ▴ 810

I am annotating a VCF with annotation from snpeff, which I want to use eventually to parse for predicted loss of function variants

I want to understand the annotation better and document how they are happening.

I run this command:

snpEff "hg38" -lof {input}

From what I read in the docs hg38 is

hg38: UCSC genome with RefSeq transcripts mapped to GRCh38/hg38 reference genome sequence

When I run snpEff databases | grep "hg38

hg38     Homo_sapiens (UCSC)    OK [https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_hg38.zip, https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_hg38.zip]

Which further supports this is UCSC

I think when I run snpEff it is calling "hg38" from here: ~/miniconda3/envs/share/snpeff-5.1-2/data/hg38, which contains these files:

cytoBand.txt.gz                 sequence.15_KI270905v1_alt.bin  sequence.1.bin                 sequence.5_KI270897v1_alt.bin  sequence.7.bin
pwms.bin                        sequence.16.bin                 sequence.20.bin                sequence.6.bin                 sequence.8.bin
sequence.10.bin                 sequence.16_KI270853v1_alt.bin  sequence.21.bin                sequence.6_GL000250v2_alt.bin  sequence.9.bin
sequence.11.bin                 sequence.17.bin                 sequence.22.bin                sequence.6_GL000251v2_alt.bin  sequence.bin
sequence.12.bin                 sequence.17_GL000258v2_alt.bin  sequence.2.bin                 sequence.6_GL000252v2_alt.bin  sequence.X.bin
sequence.13.bin                 sequence.17_KI270857v1_alt.bin  sequence.3.bin                 sequence.6_GL000253v2_alt.bin  sequence.Y.bin
sequence.14.bin                 sequence.17_KI270908v1_alt.bin  sequence.4.bin                 sequence.6_GL000254v2_alt.bin  snpEffectPredictor.bin
sequence.14_KI270847v1_alt.bin  sequence.18.bin                 sequence.5.bin                 sequence.6_GL000255v2_alt.bin
sequence.15.bin                 sequence.19.bin                 sequence.5_GL339449v2_alt.bin  sequence.6_GL000256v2_alt.bin

Which are binary for the most part and I can't really tell what is happening.

I have a few questions:

  1. Is any of my understanding above off?
  2. How can I tell which version of RefSeq is being used? Wouldn't these be updated over time as new splice sites etc are discovered.
  3. Is RefSeq even desirable from looking at predicted loos of function or are there other annotation systems, eg MANE, gencode, ensembl that the field is adopting. It seems like MANE might be the best to work with if the cost of being wrong is high as it pulls together annotations from Reqseq and ensembl iiuc
snpeff • 663 views
ADD COMMENT
0
Entering edit mode
11 months ago

in principle you can do a

snpeff dump hg38 

that will generate a text output of the content of the database. in practice when I do the above I get

snpeff dump hg38 | more       
java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
        at org.snpeff.interval.Intron.createSpliceSiteDonor(Intron.java:104)
        at org.snpeff.interval.Transcript.createSpliceSites(Transcript.java:713)
        at org.snpeff.interval.Genes.createSpliceSites(Genes.java:129)

... LOL ... we can't even unpack the database without a memory error, oh well let's bump up that snpeff memory then

snpeff -Xmx4g dump hg38 | more

among the information we can find the accession ids like NR_024540.1

so in the end it is not the release of refseq that matters but the version of the locus that has the version .1 number associated with it.

ADD COMMENT

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6