Hi,
I was wondering if anyone knows if there is a default protein database linked to the MAF files of the TCGA project?
Basically what I want to do is to get the mutated protein sequences that correspond to the missense mutations listed in the MAF files. For that, I need the correct protein sequence for each gene with a missense mutation, since the MAF file has the position of the mutation in the gene and in the protein, I can write a simple script that changes the wild type amino acid with the mutation. However, it is crucial to get the correct protein sequence, so that the position and wild type amino acid stored in the MAF file corresponds to the same amino acid in that position of the protein sequence in the database.
Thanks
I just noticed you updated your qn. To get "mutated protein sequences" for a list of missense mutations, you can use Ensembl's VEP with the ProteinSeqs plugin as explained in this post - A: Is it possible to get seq data for cancers via database?
Hi, do you work out the problme? Would you like to tell me which tool you have used?
Hi, I used the proteome file that corresponds to the GRCh build that I used. I wrote a script that checks that the wild type amino acid in the position of the mutation is the same as the one annotated in the MAF. It worked very well.
Also there's a new tool called MuPeXI. You only need to input a VCF file and one of the optional outputs is a peptide with the mutation in the middle.
Thank you for your reply. But I am still donot know how to download a large set of wild protein sequences from the website. And could you share the script that change the wild protein sequence with the mutation?
You can download from ensembl for example. If you used a different reference genome, you just have to move to the parent directories and search on the correct genome.
I'm not sure the script would work as it is right now, since I have other functions linked etc. But let me check and I'll share it on Jupyter notebook soon.
Thank you for your help, and would you like to share your script?
Hi,
You can find the code and example files in our GitHub rep here: https://github.com/cansysbio/immunogenomics/tree/neoepitopes
I should emphasise that the code and the example files are only for example purposes, if you try to run the code with the example files it won't work because you need to use the complete proteome data file. Also, some MAFs have different headers and different columns. Thus, you would need to customise the code according to the particular MAFs and proteome file you are using.