Question

Protein database for TCGA MAF files

3

Entering edit mode

9.5 years ago

Alejandro Jimenez Sanchez ▴ 180

Hi,

I was wondering if anyone knows if there is a default protein database linked to the MAF files of the TCGA project?

Basically what I want to do is to get the mutated protein sequences that correspond to the missense mutations listed in the MAF files. For that, I need the correct protein sequence for each gene with a missense mutation, since the MAF file has the position of the mutation in the gene and in the protein, I can write a simple script that changes the wild type amino acid with the mutation. However, it is crucial to get the correct protein sequence, so that the position and wild type amino acid stored in the MAF file corresponds to the same amino acid in that position of the protein sequence in the database.

Thanks

MAF TCGA • 4.3k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by Alejandro Jimenez Sanchez ▴ 180

0

Entering edit mode

I just noticed you updated your qn. To get "mutated protein sequences" for a list of missense mutations, you can use Ensembl's VEP with the ProteinSeqs plugin as explained in this post - A: Is it possible to get seq data for cancers via database?

ADD REPLY • link 8.6 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Hi, do you work out the problme? Would you like to tell me which tool you have used?

ADD REPLY • link 7.6 years ago by xue.xu • 0

0

Entering edit mode

Hi, I used the proteome file that corresponds to the GRCh build that I used. I wrote a script that checks that the wild type amino acid in the position of the mutation is the same as the one annotated in the MAF. It worked very well.

Also there's a new tool called MuPeXI. You only need to input a VCF file and one of the optional outputs is a peptide with the mutation in the middle.

ADD REPLY • link 7.6 years ago by Alejandro Jimenez Sanchez ▴ 180

0

Entering edit mode

Thank you for your reply. But I am still donot know how to download a large set of wild protein sequences from the website. And could you share the script that change the wild protein sequence with the mutation?

ADD REPLY • link 7.6 years ago by xue.xu • 0

0

Entering edit mode

You can download from ensembl for example. If you used a different reference genome, you just have to move to the parent directories and search on the correct genome.

I'm not sure the script would work as it is right now, since I have other functions linked etc. But let me check and I'll share it on Jupyter notebook soon.

ADD REPLY • link 7.6 years ago by Alejandro Jimenez Sanchez ▴ 180

0

Entering edit mode

Thank you for your help, and would you like to share your script?

ADD REPLY • link 7.6 years ago by xue.xu • 0

0

Entering edit mode

Hi,

You can find the code and example files in our GitHub rep here: https://github.com/cansysbio/immunogenomics/tree/neoepitopes

I should emphasise that the code and the example files are only for example purposes, if you try to run the code with the example files it won't work because you need to use the complete proteome data file. Also, some MAFs have different headers and different columns. Thus, you would need to customise the code according to the particular MAFs and proteome file you are using.

ADD REPLY • link 7.6 years ago by Alejandro Jimenez Sanchez ▴ 180

Ram · Accepted Answer · 2015-06-15

3

Entering edit mode

9.5 years ago

Cyriac Kandoth 6.1k

The latest TCGA MAF standard is to choose the "worst affected" isoform per variant, from among the Gencode Basic v19 isoforms. These standards have changed over time, so not all MAFs will use the same isoform database. You'll find GAF files listed here, originally based on UCSC KnownGenes, but now based on Gencode Basic v19.

If you want to map each variant to Uniprot's canonical isoform per gene, then pull the Ensembl ENST IDs of all Uniprot's canonical isoforms, dump them into a text file one ID per line, and pass it to maf2maf under argument --custom-enst, to re-annotate all TCGA MAFs.

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.5 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Thanks for your answer. However, now I have a couple of new questions.

The MAFs I am using have the genome identifier 37 under the NCBI_Build column. Therefore, I was wondering if I could use directly the Ensebl GRCh37 protein database from here?

If so, there are several releases for the GRCh37, is it fine if I use the latest one (Ensembl 75:Feb 2014)?

ADD REPLY • link 9.5 years ago by Alejandro Jimenez Sanchez ▴ 180

0

Entering edit mode

Ensembl's v75 isoform DB is equivalent to Gencode v19, but Gencode Basic v19 is a well curated subset of isoforms. So you are kinda right. Really depends on what your overall goal is. Why don't you edit your original question with what your project's overall goal is. Also, be careful not to confuse the human reference sequence GRCh37 with the isoform mapping layer on top of it.

ADD REPLY • link 9.5 years ago by Cyriac Kandoth 6.1k

0

Entering edit mode

Thanks. OK, I'll edit the original question to make it clearer.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.5 years ago by Alejandro Jimenez Sanchez ▴ 180