I am working with the mc3.v0.2.8.PUBLIC.maf.gz MAF file (downloaded from here) and I need to analyze the coding sequences (CDS) of the mutated genes. I only do this for SNP mutations that are silent, missense or non-sense. I also explcitly only consider mutations which have a valid context value (i.e. a string of 11 nucleotides since its the mutated base pair +/- 5 nucleotides). The strategy I tried was:
- Use Ensembl's REST API (for grch37) to get the CDS sequences of all unique features in the MAF file. All features are transcripts so the query [object_type] that I give to the API are ensembl transcript IDs which come with the MAF file.
- For each mutation in the MAF, get the CDS position (a column in the MAF) of the mutation and get this position +/- 5 nucleotides in the retrieved CDS sequence of the corresponding transcript. These are the "fetched contexts".
- Finally, check if the fetched contexts are equal to the contexts in the MAF for checking correctness of the strategy.
Out of the contexts from 2,861,189 mutations that I am considering, only 1,369,738 have exact matches with the fetched contexts. Since the MAF is based on NCBI's build of grch37 I thought maybe the differences were due to this, so I took an example mutation which was mismatching (TCGA-02-0003-01A-01D-1490-08, ENST00000227163, CDS_Position: 379) and searched the CDS directly on NCBI. To my surprise, the CDS from NCBI matched perfectly with the context of the ENSEMBL CDS (TGTCCCCAGCC) and the MAF's context has nothing to do with either ENSEMBL or NCBI (GGCTGGGGACA).
In fact while I was writing this post, I also noticed the codon column for the example mutation did not matched the context in the MAF! It does match my fetched CDS from ENSEMBL or NCBI though. What is going on here?
EDIT: in the example mutation the reference allele also doesn't match with the codon! The ref is G and the codon column says Cca/Aca. I'm so confused