Hi,
I am trying to obtain genomic coordinates for a list of novel coding transcripts. These novel transcripts were obtained by first processing in-house generated RNA-Seq data by aligning and assembling against the Ensembl reference human genome and annotation. Using the assembled GTF file, actual transcript sequences were extracted into a fasta file via gffread.
The fasta file of transcripts was then translated in 3 frames and the resulting sequences were split at stop codons to generate a database of all possible proteins. This database was then used in tandem with mass spec files to identify novel coding transcripts.
Having identified these sequences, I would like to map the proteins back to the genome to obtain their genomic coordinates. I have been able to do this by getting the relative transcript coordinates and then converting to genome coordinates via ensembldb on R. However, some sequences have regions that lie outside annotated regions and ensembldb fails to fetch the genomic coordinates for these.
I am wondering if anyone has a good solution for this?
Maybe I miss the point but you cannot map spliced sequences back to the genome, can you? You say that you have an assembled GTF file, that is genomic space, isn't it? So this should already contain the TSS genomic coordinate and the end of the transcript, no?Edit: I think I misread the question, though I am not fully sure what OP needs so I just remove my comment here. Sorry for the mess.
Hi chiming in here, I am not sure either what is meant here. Because if proteins where identified using proteomics and the novel transcripts as a reference translated from an annotation in a GTF file, the genomic transcript coordinates must be obtained from this file by matching the identifiers in the mass-spec results with the GTF annotation.
One more thing: Why does everyone keep claiming that it is impossible to map back a spliced transcript, CDS or protein to the genome? It is indeed very easy using the right tools e.g. exonerate or gmap. See my recipe from the post yesterday: Find a gene of interest in a species genome
This might in fact apply here, too.
What I wanted (but did poorly) to say is that you will not get a fixed 1:1 mapping, so for a spliced, say 1000bp transcript, you will not get a 1000bp genomic interval unless it's a exon-only gene. Of course, with proper tools that take the splicing into account you can get the original genomic stretch incl. the introns back, but as said I do not see how this would then be different from the assembled GTF.