I'm using VEP to annotate variants in VCF files. In the analysis of the results I want to select only the protein coding transcripts. Should I look at the field "consequence" added by VEP?
I don't have a biological background and I didn't find any answer when searching on the internet until now.
However if I look at which of these variants are found in CCDS or APPRIS, I find the following.Kindly advise on how to then filter out protein coding variants?
I usually run VEP with option "--everything". In the results you can see the consequence for each overlapping transcript. You have to carefully parse the results, which is a bit complicate when there are multiallelic variants. There is a field for transcript-biotype named "BIOTYPE" and another field for the transcript ("Feature"). Just set a filter for BIOTYPE to be "protein_coding".
Alternatively, you may preload a list of protein-coding transcripts (you can get them from Biomart), and see whether the transcript in the "Feature" field is within your list.
BTW, I would also recommend not to use all the protein-coding transcripts but only the ones that are more reliable (e.g. with CCDS or APPRIS support).
You can filter your output by consequence type. Are you using the online tool or the standalone script? On the online tool, there's a little filter box above your results table where you can select Biotype is protein_coding. If you're using the script, you can run the filter script with --filter "Biotype is protein_coding".
Question resolved.
In my output VEP txt file, I don't see a "protein-coding" category. I only see the following:
table(d$BIOTYPE)
However if I look at which of these variants are found in CCDS or APPRIS, I find the following.Kindly advise on how to then filter out protein coding variants?
table(d$BIOTYPE,d$APPRIS != "-")