Entering edit mode
6.4 years ago
NB
▴
960
I have been using the following commands to annotate my vcf file with Refseq annotation on SnpEff 4.3t, followed by snpsift to extract columns of interest.
java -Xmx4g -jar snpEff.jar hg19 $panel.sorted.vcf > $panel.snpeff.hg19.vcf
java -jar SnpSift.jar extractFields -s "," -e "." $panel.snpeff.hg19.vcf CHROM POS ID REF ALT "ANN[*].FEATUREID" "ANN[*].HGVS_C" "ANN[*].HGVS_P" > $panel.snpeff.hg19.HGVS.txt
I can see that some feature IDs have been annotated with non-refseq annotations(eg 4PED, 4IDN etc) and the extraction of HGVS_P is not formatted.
Is there a better way to extract and format RefSeq and HGVS annotations ?
Thank you
CHROM POS ID REF ALT ANN[*].FEATUREID ANN[*].HGVS_C ANN[*].HGVS_P**
1 227172290 rs12593 C T 4PED:A_480-A_522:NM_020247.4,4PED:A_480-A_523:NM_020247.4,4PED:A_480-A_636:NM_020247.4,4PED:A_480-A_640:NM_020247.4,4PED:A_480-A_643:NM_020247.4,NM_020247.4 c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T,c.1440C>T .,.,.,.,.,p.Phe480Phe
1 227174210 rs3738725 T C 4PED:A_572-A_630:NM_020247.4,NM_020247.4,NM_003607.3,NM_014826.4 c.1716T>C,c.1716T>C,c.*7759A>G,c.*7759A>G .,p.Ser572Ser,.,.
14 51057727 rs1060197 G A 4IDN:B_77-B_117:NM_015915.4,4IDN:B_77-B_117:NM_001127713.1,4IDN:B_77-B_117:NM_181598.3,NM_015915.4,NM_001127713.1,NM_181598.3 c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A,c.351G>A .,.,.,p.Glu117Glu,p.Glu117Glu,p.Glu117Glu
1 112329551 rs3738298 G T NM_004980.4,NM_172198.2 c.1269+15C>A,c.1269+15C>A .,.
1 156785617 rs1800601 G A NM_001007792.1,NM_001161441.1,NM_001161443.1,NM_001161442.1,NM_003975.3,NM_001161444.1 c.-5G>A,c.123+181C>T,c.39+181C>T,c.69+181C>T,c.123+181C>T,c.123+181C>T .,.,.,.,.,.
X 70442845 rs6525485 G A NM_000166.5,NR_001568.1,NM_001097642.2 c.-713G>A,n.173-12792C>T,c.-16-697G>A .,.,.
Hello Nandini,
what do you mean by this?
fin swimmer
Hi fin swimmer,
So if you see the result snippet in the last column, the HGVS_P columns, when extracted looks like this
Yes, but this is correct. Without any paramater SnpEff annotates your variants for every transcript it has in its database.
Let's take this line:
There are 3 transcript. And so you have 3 values for HGVS_C and 3 values HGVS_P each seperated with a comma. If you just want one transcript per variant you have to provide a list of transcript numbers using
-onlyTr <file.txt>
or let SnpEff choose the canonical transcript with-canon
.fin swimmer
yup, unfortunately, we need to report all transcripts as we are working under a clinical setting.
So if you need all transcripts, what is your question? :) All desired informations are there.
Can you give an example of the output format you like to have?
fin swimmer
My question is 1) Is there a way to avoid non-ref seq annotations such as
4PED:A_480-A_522
4IDN:B_77-B_117
2) Is there a better way to extract the columns in a formatted way without the commas.
Desired out should be like
Could you please provide an unannotated extract of your vcf file, that one can play around with it?
Should I paste it here or upload it on slack as DM ? Its a file with ~20 variants
Upload is fine if you can.
I have uploaded the file on slack
Hello,
I take a look at this. I haven't found an option for
snpEff
to get rid of those transcripts like4PED:A_480-A_522
. Furthermore I think your desired output is not a good idea. You like to remove empty values forHGVS_P
. But doing so you lost the information which value belongs to whichHGVS_C
and transcript.I have another suggestion:
snpEff
provides a script to split the annotation for each variant into one line per transcript. We could then remove those lines where the transcript doesn't start withN
(or NM if you also want to removeNR
). Your output will than looks like this:You get this result with
fin swimmer
thanks fin swimmer.... this is helpful... I'll give it a go and see if this can be implemented for us!
Sorry, just adding on to this, would anyone know if snpeff provides the annotation to encoded proteins (NP_) ids ? Currently, it annotates only to RefSeq transcripts (NM_ and NR_)
Hello Nandini,
I'm not aware of such an option. But as NM and NP are bind to each other, you could find out the correspondig NP using BioMart.
fin swimmer
thanks fin swimmer.
It's when I annotate the tool with
-hgvsTrId
option, it ends up linking the refseq transcript to the protein (HGVS_P). So my results look like this now, which again is technically incorrect - HGVS_P should ideally be linked to a protein NP_ ids