Hi all,
What I need to do is filter a file produced using non-stringent Variant Effect Predictor (VEP) settings with one that was produced with more stringent VEP settings.
I've been running VEP locally using the cache option with a pre-built cache with this command on my vcfs:
perl $VEP \
--cache \
--dir $VEP_DIR \
--offline \
--input_file $input \
--output_file $output \
--sift b \
--polyphen b \
--regulatory \
--protein \
--symbol \
--ccds \
--uniprot \
--check_existing \
--gmaf \
--maf_1kg \
--maf_esp \
--pubmed
Everything works great and I'm super happy with the documentation. However, I realized after I had run my command on all my exomes that I would most likely get many entries for each particular variant depending on different Ensembl Feature IDs.
VEP has a fix for this, which is to use the --most_severe
flag when running the command. That works perfectly, however, some extra flags are disabled when using the --most_severe
flag. I would like to retain this extra information (like gene name/symbol Feature,Consequence, etc.) for the variants produced with the --most_severe
flag.
perl $VEP \
--cache \
--dir $VEP_DIR \
--offline \
--input_file $input \
--output_file $output \
--regulatory \
--uniprot \
--check_existing \
--gmaf \
--maf_1kg \
--maf_esp \
--most_severe
So now I have two files for each vcf; 1) disabled --most_severe
and 2) --most_severe
. The 2nd file is basically a subset of the 1st file but with some important missing information.
In the 1st file when there are multiple entries for a variant, most of the fields are the same except the Feature_type
field and often the Extra
field.
Both produce a tab delimited text file with columns such as this:
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
Is there a way to filter the 1st file with the 2nd file. I think I need to use fields Uploaded_variation
and Consequence
for matching the 1st file because those are the fields that are unique in the line.
I think using awk to search for columns in both files won't work because there is some information lost in the Consequence field in the 2nd file
For example a variant Consequence may change from:
non_coding_transcript_exon_variant,non_coding_transcript_variant
to
non_coding_transcript_exon_variant
I appreciate any help in solving this issue. Alternatively there is a filter_vep
script provided by VEP for post-VEP annotation filtering but I don't think there is an option here that will solve my problem.
Thanks,
Tesa
Emily please help him/her.