extracting information from gnomad
1
0
Entering edit mode
23 months ago
Eliza ▴ 40

Hi , I downloaded data from the gnomad on Y genome . the meta data of the VEP field of the INFO column in which I'm intrested looks like this :

##INFO=<ID=vep,Number=.,Type=String,Description=\"Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|GMAF|AFR_MAF|AMR_MAF|EAS_MAF|EUR_MAF|SAS_MAF|AA_MAF|EA_MAF|ExAC_MAF|ExAC_Adj_MAF|ExAC_AFR_MAF|ExAC_AMR_MAF|ExAC_EAS_MAF|ExAC_FIN_MAF|ExAC_NFE_MAF|ExAC_OTH_MAF|ExAC_SAS_MAF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info\">"

Im interested in the fields: SYMBOL,Gene,SIFT and PolyPhen. I used this code to extract the symbole for example in R :

for (i in 1:length(data_snp$vep)) {
  data_snp$variant_info_gene_name[i]=read.delim(text = data_snp$vep[i], sep = "|", header = FALSE)[,c(4)] 

}

but the problem is when i want to get the SIFT and PolyPhen as i dont understand how to find out their position (the position of "symbole" in the string was 4) based on the example VEP of an SNP in Y genome :

"G|3_prime_UTR_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000383070|protein_coding|1/1||ENST00000383070.1:c.*21T>C||732||||||1||-1||SNV|1|HGNC|11311|YES|||CCDS14772.1|ENSP00000372547|Q05066|Q6J4J1&A7WPU8|UPI0000135F78|1|||||||||||||||||||||||||||||||||,G|upstream_gene_variant|MODIFIER|RNASEH2CP1|ENSG00000237659|Transcript|ENST00000454281|processed_pseudogene|||||||||||1|2859|1||SNV|1|HGNC|24117|YES|||||||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|RNU6-1334P|ENSG00000251841|Transcript|ENST00000516032|snRNA|||||||||||1|2115|1||SNV|1|HGNC|48297|YES|||||||||||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000525526|protein_coding|||||||||||1|40|-1||SNV|1|HGNC|11311|||||ENSP00000437575||F5H6J8|UPI0002064E1A|1|||||||||||||||||||||||||||||||||,G|downstream_gene_variant|MODIFIER|SRY|ENSG00000184895|Transcript|ENST00000534739|protein_coding|||||||||||1|136|-1||SNV|1|HGNC|11311|||||ENSP00000438917||F5H3H1|UPI0002064E1B|1|||||||||||||||||||||||||||||||||"
> data_snp$vep[1][3]

how can i extrcat SIFT and PolyPhen on the data above based on their position as the are many empty "|" seperators (|YES|||||||||||||||||||||||||||||||||||||||||-> example ) is there a "smater" way than just counting th position? I lnow that in meta data the SIFT and PolyPhen are at position 36 and 37 in the string my question is are they at the same position when looking at a specific feild of vep of an SNP like above ?

vcf info snp gnomad • 982 views
ADD COMMENT
0
Entering edit mode
23 months ago

how can i extrcat SIFT and PolyPhen on the data above based on their position as the are many empty

the variant is not in the protein but in the UTR (3_prime_UTR_variant). There is such score for non-protein variant:

" PolyPhen-2 (Polymorphism Phenotyping v2) is a tool which predicts possible impact of an amino acid substitution"

"SIFT" web server: predicting effects of amino acid substitutions on proteins

ADD COMMENT
0
Entering edit mode

so if I understand correctly in this case those fields would be empty as this variant is not in the protein?

ADD REPLY
0
Entering edit mode

Yes UTR means untranslated region.

Since these tools predict about amino acid changes you won't see information about them outside of exons.

ADD REPLY

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6