Hi , I downloaded data from the gnomad on Y genome . the meta data of the VEP field of the INFO column in which I'm intrested looks like this :
##INFO=<ID=vep,Number=.,Type=String,Description=\"Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|GMAF|AFR_MAF|AMR_MAF|EAS_MAF|EUR_MAF|SAS_MAF|AA_MAF|EA_MAF|ExAC_MAF|ExAC_Adj_MAF|ExAC_AFR_MAF|ExAC_AMR_MAF|ExAC_EAS_MAF|ExAC_FIN_MAF|ExAC_NFE_MAF|ExAC_OTH_MAF|ExAC_SAS_MAF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info\">"
Im interested in the fields: SYMBOL,Gene,SIFT and PolyPhen. I used this code to extract the symbole for example in R :
for (i in 1:length(data_snp$vep)) {
data_snp$variant_info_gene_name[i]=read.delim(text = data_snp$vep[i], sep = "|", header = FALSE)[,c(4)]
}
but the problem is when i want to get the SIFT and PolyPhen as i dont understand how to find out their position (the position of "symbole" in the string was 4) based on the example VEP of an SNP in Y genome :
"G|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*21T>C||732||||||1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1|||||||||||||||||||||||||||||||||,G|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene|||||||||||1|2859|1||SNV|1|HGNC|24117|YES|||||||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA|||||||||||1|2115|1||SNV|1|HGNC|48297|YES|||||||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding|||||||||||1|40|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1|||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding|||||||||||1|136|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1|||||||||||||||||||||||||||||||||"
> data_snp$vep[1][3]
how can i extrcat SIFT and PolyPhen on the data above based on their position as the are many empty "|" seperators (|YES|||||||||||||||||||||||||||||||||||||||||-> example ) is there a "smater" way than just counting th position? I lnow that in meta data the SIFT and PolyPhen are at position 36 and 37 in the string my question is are they at the same position when looking at a specific feild of vep of an SNP like above ?
so if I understand correctly in this case those fields would be empty as this variant is not in the protein?
Yes UTR means untranslated region.
Since these tools predict about amino acid changes you won't see information about them outside of exons.