Annotation with snpEFF
0
3
Entering edit mode
6.4 years ago
NB ▴ 960

I have been using the following commands to annotate my vcf file with Refseq annotation on SnpEff 4.3t, followed by snpsift to extract columns of interest.

java -Xmx4g -jar snpEff.jar hg19 $panel.sorted.vcf > $panel.snpeff.hg19.vcf

java -jar SnpSift.jar extractFields -s "," -e "." $panel.snpeff.hg19.vcf CHROM POS ID REF ALT "ANN[*].FEATUREID" "ANN[*].HGVS_C" "ANN[*].HGVS_P" > $panel.snpeff.hg19.HGVS.txt

I can see that some feature IDs have been annotated with non-refseq annotations(eg 4PED, 4IDN etc) and the extraction of HGVS_P is not formatted.

Is there a better way to extract and format RefSeq and HGVS annotations ?

Thank you

CHROM   POS ID  REF ALT ANN[*].FEATUREID    ANN[*].HGVS_C   ANN[*].HGVS_P**
1   227172290   rs12593 C   T   4PED:A_480-A_522:NM_020247.4,4PED:A_480-A_523:NM_020247.4,4PED:A_480-A_636:NM_020247.4,4PED:A_480-A_640:NM_020247.4,4PED:A_480-A_643:NM_020247.4,NM_020247.4    c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T .,.,.,.,.,p.Phe480Phe
1   227174210   rs3738725   T   C   4PED:A_572-A_630:NM_020247.4,NM_020247.4,NM_003607.3,NM_014826.4    c.1716T>C,c.1716T>C,c.*7759A>G,c.*7759A>G   .,p.Ser572Ser,.,.
14  51057727    rs1060197   G   A   4IDN:B_77-B_117:NM_015915.4,4IDN:B_77-B_117:NM_001127713.1,4IDN:B_77-B_117:NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3   c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A   .,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
1   112329551   rs3738298   G   T   NM_004980.4,NM_172198.2 c.1269+15C>A,c.1269+15C>A   .,.
1   156785617   rs1800601   G   A   NM_001007792.1,NM_001161441.1,NM_001161443.1,NM_001161442.1,NM_003975.3,NM_001161444.1  c.-5G>A,c.123+181C>T,c.39+181C>T,c.69+181C>T,c.123+181C>T,c.123+181C>T  .,.,.,.,.,.
X   70442845    rs6525485   G   A   NM_000166.5,NR_001568.1,NM_001097642.2  c.-713G>A,n.173-12792C>T,c.-16-697G>A   .,.,.
annotation vcf snpeff extractFields • 5.7k views
ADD COMMENT
1
Entering edit mode

Hello Nandini,

and the extraction of HGVS_P is not formatted

what do you mean by this?

fin swimmer

ADD REPLY
0
Entering edit mode

Hi fin swimmer,

So if you see the result snippet in the last column, the HGVS_P columns, when extracted looks like this

ANN[*].HGVS_P**
.,.,.,.,.,p.Phe480Phe
.,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
.,.,.,.,.,.
.,.,.
ADD REPLY
0
Entering edit mode

Yes, but this is correct. Without any paramater SnpEff annotates your variants for every transcript it has in its database.

Let's take this line:

X   70442845    rs6525485   G   A   NM_000166.5,NR_001568.1,NM_001097642.2  c.-713G>A,n.173-12792C>T,c.-16-697G>A   .,.,.

There are 3 transcript. And so you have 3 values for HGVS_C and 3 values HGVS_P each seperated with a comma. If you just want one transcript per variant you have to provide a list of transcript numbers using -onlyTr <file.txt> or let SnpEff choose the canonical transcript with -canon.

fin swimmer

ADD REPLY
0
Entering edit mode

yup, unfortunately, we need to report all transcripts as we are working under a clinical setting.

ADD REPLY
0
Entering edit mode

So if you need all transcripts, what is your question? :) All desired informations are there.

Can you give an example of the output format you like to have?

fin swimmer

ADD REPLY
0
Entering edit mode

My question is 1) Is there a way to avoid non-ref seq annotations such as 4PED:A_480-A_522 4IDN:B_77-B_117

    4PED:A_480-A_522:NM_020247.4,4PED:A_480-A_523:NM_020247.4,4PED:A_480-A_636:NM_020247.4,4PED:A_480-A_640:NM_020247.4,4PED:A_480-A_643:NM_020247.4,NM_020247.4
 4IDN:B_77-B_117:NM_015915.4,4IDN:B_77-B_117:NM_001127713.1,4IDN:B_77-B_117:NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3

2) Is there a better way to extract the columns in a formatted way without the commas.

ANN[*].HGVS_P**
.,.,.,.,.,p.Phe480Phe
.,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
.,.,.,.,.,.
.,.,.

Desired out should be like

CHROM   POS ID  REF ALT ANN[*].FEATUREID    ANN[*].HGVS_C   ANN[*].HGVS_P**
1   227172290   rs12593 C   T   NM_020247.4,NM_020247.4,NM_020247.4,NM_020247.4,NM_020247.4,NM_020247.4 c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T p.Phe480Phe
1   227174210   rs3738725   T   C   NM_020247.4,NM_020247.4,NM_003607.3,NM_014826.4 c.1716T>C,c.1716T>C,c.*7759A>G,c.*7759A>G   p.Ser572Ser
14  51057727    rs1060197   G   A   NM_015915.4,NM_001127713.1,NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3   c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A   p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
1   112329551   rs3738298   G   T   NM_004980.4,NM_172198.2 c.1269+15C>A,c.1269+15C>A   
1   156785617   rs1800601   G   A   NM_001007792.1,NM_001161441.1,NM_001161443.1,NM_001161442.1,NM_003975.3,NM_001161444.1  c.-5G>A,c.123+181C>T,c.39+181C>T,c.69+181C>T,c.123+181C>T,c.123+181C>T  
X   70442845    rs6525485   G   A   NM_000166.5,NR_001568.1,NM_001097642.2  c.-713G>A,n.173-12792C>T,c.-16-697G>A
ADD REPLY
0
Entering edit mode

Could you please provide an unannotated extract of your vcf file, that one can play around with it?

ADD REPLY
0
Entering edit mode

Should I paste it here or upload it on slack as DM ? Its a file with ~20 variants

ADD REPLY
0
Entering edit mode

Upload is fine if you can.

ADD REPLY
0
Entering edit mode

I have uploaded the file on slack

ADD REPLY
1
Entering edit mode

Hello,

I take a look at this. I haven't found an option for snpEff to get rid of those transcripts like 4PED:A_480-A_522. Furthermore I think your desired output is not a good idea. You like to remove empty values for HGVS_P. But doing so you lost the information which value belongs to which HGVS_C and transcript.

I have another suggestion: snpEff provides a script to split the annotation for each variant into one line per transcript. We could then remove those lines where the transcript doesn't start with N (or NM if you also want to remove NR). Your output will than looks like this:

14  78028803    rs2364602   A   G   NM_004863.3 c.786T>C    p.Asn262Asn
14  102508056   rs10129889  C   A   NM_001376.4 c.12087C>A  p.His4029Gln
14  102515015   rs1004903   G   A   NM_001376.4 c.13372+9G>A    .
X   70442845    rs6525485   G   A   NM_000166.5 c.-713G>A   .
X   70442845    rs6525485   G   A   NR_001568.1 n.173-12792C>T  .
X   70442845    rs6525485   G   A   NM_001097642.2  c.-16-697G>A    .

You get this result with

/path/to/snpEff/scripts/vcfEffOnePerLine.pl < input.snpeff.hg19.vcf \ 
| java -jar /path/to/snpEff/SnpSift.jar extractFields -s "," -e "." - CHROM POS ID REF ALT "ANN[*].FEATUREID" "ANN[*].HGVS_C" "ANN[*].HGVS_P" \ 
| awk -v OFS="\t" '$6 ~ "^N" {print}' > out.snpeff.hg19.HGVS.txt

fin swimmer

ADD REPLY
0
Entering edit mode

thanks fin swimmer.... this is helpful... I'll give it a go and see if this can be implemented for us!

ADD REPLY
0
Entering edit mode

Sorry, just adding on to this, would anyone know if snpeff provides the annotation to encoded proteins (NP_) ids ? Currently, it annotates only to RefSeq transcripts (NM_ and NR_)

ADD REPLY
0
Entering edit mode

Hello Nandini,

I'm not aware of such an option. But as NM and NP are bind to each other, you could find out the correspondig NP using BioMart.

fin swimmer

ADD REPLY
0
Entering edit mode

thanks fin swimmer.

It's when I annotate the tool with -hgvsTrId option, it ends up linking the refseq transcript to the protein (HGVS_P). So my results look like this now, which again is technically incorrect - HGVS_P should ideally be linked to a protein NP_ ids

CHROM   POS REF ALT ANN[*].HGVS_C   ANN[*].HGVS_P
5   148386525   T   G   NM_024577.3:c.3594A>C   NM_024577.3:p.Pro1198Pro
5   148407708   A   C   NM_024577.3:c.1587T>G   NM_024577.3:p.Arg529Arg
5   148407893   C   A   NM_024577.3:c.1402G>T   NM_024577.3:p.Ala468Ser
5   148408101   A   G   NM_024577.3:c.1194T>C   NM_024577.3:p.Gly398Gly
ADD REPLY

Login before adding your answer.

Traffic: 2528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6