Reannotating a gene: Identifying translation and transcription start sites
0
1
Entering edit mode
7.9 years ago
rh5118 ▴ 40

Hi all -

I'm working on a particular gene in Plasmodium falciparum that I have recently shown is misannotated in the reference genome. The current annotation shows that the gene has three exons; however, our recent data suggests that "exon 1" is not transcribed and spliced with exons 2 and/or 3. Therefore, we believe that the actual gene is only exons 2 and 3. I'm wondering how we can identify the new start codon. At my disposal I have several RNA-seq datasets, the ability to do RT-PCR and sequencing, and most common molecular biology experiments to probe this. I may be able to get a proteome dataset from a collaborator if needed. Is there a way to identify the start codon based on read coverage around a particular methionine/ if you have a read gap around a Met (ie. a density of reads that start with the methionine and no reads that contain sequence 5' from the methionine)? Should I look for peptides that start with methionines in a proteomic dataset? Should I design primers to try to amplify the transcript and sequence in reverse to find the transcription start site? Help me please! :)

RNA-Seq genome gene • 2.7k views
ADD COMMENT
2
Entering edit mode

With RACE-PCR you could determine a more exact transcript to validate your hypotheses about the missing exon.

ADD REPLY
0
Entering edit mode

Thanks for this suggestion. Do you know if this technique is going to be limited by transcript size? While I can make a guess at which Met is the start codon, I could certainly be wrong and the possibly transcript length for 5' RACE-PCR would be much longer. My thinking is that I can design it so I would aim to have a nice 300-1000 bp product (if my theories are correct), but if I'm wrong, I could be trying to amplify a very large fragment which could be difficult.

ADD REPLY
1
Entering edit mode

Honestly, this leads the question far away from bioinformatics, and wet-lab is not my core competence ;) I just know that our technicians do this on a regular basis on L. salmonis transcripts and I have never heard about a limitation in transcript size, recently got a much longer transcript validated. Normally, we obtain a 5'- and 3'- race, and a normal PCR based consensus sequence.

The problems possibly rather come from alternative transcripts or recently duplicated genes.

Here is the M&M part describing the setting in our paper naming the relevant kits:

PCR on cDNA templates was performed using GoTaq Flexi DNA Polymerase (Promega). cDNA was synthesised from total RNA using the qScript cDNA Synthesis Kit (Quanta Biosciences). Rapid amplification of cDNA ends (RACE) was carried out using the SMARTer RACE cDNA Amplifictation Kit (Clontech). PCR templates for in situ probe and RNAi fragment synthesis were made using the primers listed in Table 1. PCR and RACE products were sequenced at the University of Bergen's sequencing facility using BigDye Terminator v3.1 reagents (Applied Biosystems). The sequences of LsIRP1A and LsIRP1B have been submitted to GenBank with the following accession numbers: LsIRP1A: KP057804; LsIRP1B: KP057805.

ADD REPLY
1
Entering edit mode

I think OPs main interest is in the translation start site (given him talking about Met/start codon), which if I'm not mistaken will not be found using RACE because it will find your transcription start site.

ADD REPLY
1
Entering edit mode

I got that point, but without ribo-seq or proteomics approach there is afaik nothing better than having the correct transcript. We normally get the validated transcript, take the longest ORF. At least nobody has complained about that ;)

ADD REPLY
1
Entering edit mode

In an oversimplified world, you could do a simple western blot and see if the mass of your protein matches to the theoretical mass based on each potential start codon. That being said, I've never done (and never will do) a western blot. This will also crucially depend on the availability of an antibody.

ADD REPLY
1
Entering edit mode

Both really - as Michael points out (and in his comment below), it's probably technically easier to do RACE-PCR to figure out the full length transcript and then take the longest ORF. At least this will validate the splicing variation or genome misannotation theory. I can do a simple RT-PCR across the exon 2/3 junction and sequence that product to verify they are spliced as annotated (Can't RT-PCR amplify across exon 1 and 2 no matter how you try to do it, while clear product from exon 2/3). The 3' end of the gene that spans the exon 2/3 gap contains a well conserved functional domain so I'm less worried about that region.

Fortunately I have recently HA-tagged the endogenous locus by CRISPR-Cas9 editing so I should be able to do a rough validation of protein size/ORF selection via Western as you suggest WouterDeCoster.

ADD REPLY
0
Entering edit mode

What about Edman sequencing? ;-)

ADD REPLY
0
Entering edit mode

Good one! Theoretically I could do it, but what an expensive mess that would be

ADD REPLY
0
Entering edit mode

You could try to find an institution with who you can collaborate... I'm too young to have used the technology myself, but it might prove useful. But I would definitely start with the western. Having the HA tag is a huge advantage. How did you do that CRISPR-based tagging exactly?

ADD REPLY
1
Entering edit mode

Simply pick a suitable PAM site as close to the STOP codon as possible (unfortunately for me this was about 200 bp away) and then design a repair template to silently recodonise the CDS between the cut site and the STOP with a 3xHA tag stuck on the end of the CDS. You flank your changes (in this case the 200 bp recodonisation and exogenous HA sequence) with 300-500 bp "arms" that are homologous to the sequences flanking the desired integrated change (in this case 500 bp upstream of the cut site and 500 bp downstream of the STOP). Transfect and drug select for parasites containing the plasmid and bingo you've got your tag!

This works well in P falciparum because although transfection and CRISPR are horribly inefficient, the parasites don't have canonical non-homologous end joining pathways. Once you've cleaved the chromosome, the only way for the parasite to survive is to use homology-driven repair using the template you design and supply to fix it.

ADD REPLY
0
Entering edit mode

I would suggest ribo seq but perhaps that's beyond what's possible as you described.

ADD REPLY
0
Entering edit mode

Thanks - unfortunately you're probably right. It's a good idea to have in my back pocket if needed, but it's technically beyond what I would like to do as a first approach

ADD REPLY

Login before adding your answer.

Traffic: 1488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6