Hi, I have aligned rna-seq fastq from dog bladder cancer line to the dog genome using STAR (Ensembl reference genome and annotation (GTF)). Ultimately I need to get protein sequences for each of the unique transcripts that were present in the sequenced RNA. The approach I thought to take was to extract all unique assembled transcripts from the BAM file and then translate each of them to a protein sequence to generate a fasta file of protein sequence with the gene name in the header. Is this feasible, and if so, what approach would be the best? Searching around I found some approaches using samtool pileup and bcftools and another one using extractTranscriptSeq() in GenomicFeatures (Bioconductor) but I have not been able to put together the pieces of information from different posts for my specific purpose. I would greatly appreciate any help with determining the feasibility of this task and the best tools/approach. Thanks, - Pankaj
Thanks for the suggestions. Actually I am interested in changes to the protein but using a proteomics approach rather than a bioinformatics or genomic approach. The idea is to extract the aligned reads from the BAM file and assemble them into transcript sequence. Then we want to translate the transcript sequence into a protein sequence. The last step is the search a Mass Spec spectra database that have for mutations or other genomic alternations. The MS spectra database was made from cell lines derived from cancer cells and therefore represent real mutations. So the idea is to identify mutant transcripts (and therefore originating genes) from the rna-seq data using the experimentally derived protein mutations. I am guessing mpileup might still be used for extracting the reads and assembling them into transcript sequences, but I don't have experience with this tool. Thanks.