I'm hunting for a member of protein family in an unannotated genome. With a bit of luck and the help of transcriptome data from a distantly related species, I found that the gene seems to be split across three contigs because of TA (or AT?) repeats:
contig 1-(TA)*-small contig 2-(TA)*-contig 3
How would you predict the gene in this case: Run separate predictions for the three contigs? Glue the contigs together manually?
(Also, I'd be happy for any pointer on the function of the TA repeat.)
Just curious - how do you know there are (AT)n between the contigs? ... I would start by predicting partial genes on each contig. The contig merging can be justified if you have some kind of linking information - like paired reads mapped to adjacent contigs. Transcript from a distant relative sounds a bit risky as evidence.
contig 1 has AT repeats at the end, contig 2 at both ends and contig 3 at the start. Of course I don't know if they're really connected, but given that all contigs harbor a fragment of the same gene it seems likely.
I would join the contigs if a couple important criteria are met:
1) Each contig must align to a unique portion of the distant relative with no overlap in residue positions covered (on that relative)? You don't want contig A to match amino acids 15 - 97 and contig B to match amino acids 85 to 188 as this indicates that those two contigs should be joined during assembly of the genome and not manually for the sake of this gene hunting/modeling.
2) Each contig should have relatively the same percent identity and percent similarity. Relatively is key here and it is hard to define what is an acceptable range. You do not want to be dealing with paralogs - 2 genes - when you're assuming a single gene. In other words, don't manually create a gene fusion.
I would also run the translation of the contigs against motif finders (Pfam eg) to assist in identifying what may be missing, if anything, from the protein-coding portion of your gene model.
As to biological function of the TA repeats - could be transposon insertion sites/remnants, could be structural for DNA itself, could be but are unlikely binding sites for DNA modification enzymes or transcription factors. Genetically, these can be used as markers.
I know that this is a bioinformatics forum and I do love bioinformatics solutions, but this might be a case for the wet lab. Just see if you or someone in your lab can PCR the gaps to verify that the contigs should be glued together.
If a PCR product forms between the contigs, then the predicted synteny is correct and you can glue together the contigs.
Sure, if I had a specimen of the organism, this would be an option... ;-) But you're right, I was thinking about contacting a lab that works with the organism to see if we could culture it as well.
This is what a sequencing center would call "finishing." Even if it is a wet lab solution, you still need bioinformatics to propose primers and analyze any joins made. You can use bioinformatics to predict if this gene is in a family of one member or more by looking at the same gene in other similar organisms.
Just curious - how do you know there are (AT)n between the contigs? ... I would start by predicting partial genes on each contig. The contig merging can be justified if you have some kind of linking information - like paired reads mapped to adjacent contigs. Transcript from a distant relative sounds a bit risky as evidence.
contig 1 has AT repeats at the end, contig 2 at both ends and contig 3 at the start. Of course I don't know if they're really connected, but given that all contigs harbor a fragment of the same gene it seems likely.