Question

extracting information from gnomad

0

Entering edit mode

23 months ago

Eliza ▴ 40

Hi , I downloaded data from the gnomad on Y genome . the meta data of the VEP field of the INFO column in which I'm intrested looks like this :

##INFO=<ID=vep,Number=.,Type=String,Description=\"Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|GMAF|AFR_MAF|AMR_MAF|EAS_MAF|EUR_MAF|SAS_MAF|AA_MAF|EA_MAF|ExAC_MAF|ExAC_Adj_MAF|ExAC_AFR_MAF|ExAC_AMR_MAF|ExAC_EAS_MAF|ExAC_FIN_MAF|ExAC_NFE_MAF|ExAC_OTH_MAF|ExAC_SAS_MAF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info\">"

Im interested in the fields: SYMBOL,Gene,SIFT and PolyPhen. I used this code to extract the symbole for example in R :

for (i in 1:length(data_snp$vep)) {
  data_snp$variant_info_gene_name[i]=read.delim(text = data_snp$vep[i], sep = "|", header = FALSE)[,c(4)] 

}

but the problem is when i want to get the SIFT and PolyPhen as i dont understand how to find out their position (the position of "symbole" in the string was 4) based on the example VEP of an SNP in Y genome :

"G|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*21T>C||732||||||1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1|||||||||||||||||||||||||||||||||,G|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene|||||||||||1|2859|1||SNV|1|HGNC|24117|YES|||||||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA|||||||||||1|2115|1||SNV|1|HGNC|48297|YES|||||||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding|||||||||||1|40|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1|||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding|||||||||||1|136|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1|||||||||||||||||||||||||||||||||"
> data_snp$vep[1][3]

how can i extrcat SIFT and PolyPhen on the data above based on their position as the are many empty "|" seperators (|YES|||||||||||||||||||||||||||||||||||||||||-> example ) is there a "smater" way than just counting th position? I lnow that in meta data the SIFT and PolyPhen are at position 36 and 37 in the string my question is are they at the same position when looking at a specific feild of vep of an SNP like above ?

vcf info snp gnomad • 982 views

ADD COMMENT • link updated 23 months ago by barslmn ★ 2.3k • written 23 months ago by Eliza ▴ 40

score 0 · Answer 1 · 2023-01-07

0

Entering edit mode

23 months ago

Pierre Lindenbaum 164k

how can i extrcat SIFT and PolyPhen on the data above based on their position as the are many empty

the variant is not in the protein but in the UTR (3_prime_UTR_variant). There is such score for non-protein variant:

" PolyPhen-2 (Polymorphism Phenotyping v2) is a tool which predicts possible impact of an amino acid substitution"

"SIFT" web server: predicting effects of amino acid substitutions on proteins

ADD COMMENT • link 23 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

so if I understand correctly in this case those fields would be empty as this variant is not in the protein?

ADD REPLY • link 23 months ago by Eliza ▴ 40

0

Entering edit mode

Yes UTR means untranslated region.

Since these tools predict about amino acid changes you won't see information about them outside of exons.

ADD REPLY • link 23 months ago by barslmn ★ 2.3k