We would like to do some analyses finding variants in tumor/normal pairs from TCGA. However, I think I am doing something wrong as sometimes the coordinates are not quite accurate. Would appreciate any suggestions.
- First we downloaded reference genomes from https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files
- Ran sarek to determine variants and pointing to the TCGA genome GRCh38.d1.vd1.fa.tar.gz (fetched from the website in (1) )
- Built a database for SnpEff using SnpEff built using the gencode annotation from TCGA gencode.v36.annotation.gtf.gz and the same genome as (2)
- Then we ran SnpEff and annotated with the gene information from what was built in (3)
However, if I try to fetch a sequence, say ENST00000518579.5, using biomart:
genome=useEnsembl(biomart="ensembl",version=102,dataset = "hsapiens_gene_ensembl")
referenceProteinSequence <- biomaRt::getBM(attributes= c("peptide", "hgnc_symbol","protein_id"),
filters=”ensembl_transcript_id",
values = “ENST00000518579”,
mart=genome)
The variant that is found in Sarek says that the reference is R167, but that amino acid in the reference protein sequence from BioMart is a T:
strsplit(referenceProteinSequence$peptide,"")[[1]][167]
“T”
This does not happen all the time, but sometimes. Do you at all have any kind of tip as to how I can figure out what is happening? I think I am probably pointing to the wrong genome version in biomart? Any help appreciated.