Question

rna-seq alignment bam extract transcripts convert to protein

1

Entering edit mode

8.3 years ago

bioinformatics.cancer ▴ 260

Hi, I have aligned rna-seq fastq from dog bladder cancer line to the dog genome using STAR (Ensembl reference genome and annotation (GTF)). Ultimately I need to get protein sequences for each of the unique transcripts that were present in the sequenced RNA. The approach I thought to take was to extract all unique assembled transcripts from the BAM file and then translate each of them to a protein sequence to generate a fasta file of protein sequence with the gene name in the header. Is this feasible, and if so, what approach would be the best? Searching around I found some approaches using samtool pileup and bcftools and another one using extractTranscriptSeq() in GenomicFeatures (Bioconductor) but I have not been able to put together the pieces of information from different posts for my specific purpose. I would greatly appreciate any help with determining the feasibility of this task and the best tools/approach. Thanks, - Pankaj

RNA-Seq bam protein sequence transcripts • 3.3k views

ADD COMMENT • link updated 8.3 years ago by microfuge ★ 1.9k • written 8.3 years ago by bioinformatics.cancer ▴ 260

score 1 · Answer 1 · 2016-08-04

1

Entering edit mode

8.3 years ago

microfuge ★ 1.9k

Presuming that you are interested in changes in proteins in the case sample one way could be to use mpileup|bcftools to get a vcf. Then use vcf-consensus to to changed the reference nucleotide in the genome fasta by the alternate allele. Then extract orf sequences from the changed genome fasta using bedtools and genome annotation bed file. The orfs can then be translated by a script to get the altered protein sequences. Alternatively you could use snpeff to annotate the vcf and it will give the type of change and the location in protein.

ADD COMMENT • link 8.3 years ago by microfuge ★ 1.9k

0

Entering edit mode

Thanks for the suggestions. Actually I am interested in changes to the protein but using a proteomics approach rather than a bioinformatics or genomic approach. The idea is to extract the aligned reads from the BAM file and assemble them into transcript sequence. Then we want to translate the transcript sequence into a protein sequence. The last step is the search a Mass Spec spectra database that have for mutations or other genomic alternations. The MS spectra database was made from cell lines derived from cancer cells and therefore represent real mutations. So the idea is to identify mutant transcripts (and therefore originating genes) from the rna-seq data using the experimentally derived protein mutations. I am guessing mpileup might still be used for extracting the reads and assembling them into transcript sequences, but I don't have experience with this tool. Thanks.

ADD REPLY • link 8.3 years ago by bioinformatics.cancer ▴ 260