Dear Biostars community, I am working for a project in which I have to generate features from a RNA sequence to classify if it is gonna translate or not.
I though about adding Ribosome Foot Print (RFP) and Expression levels. For these features, I built a huge Ribo-seq and RNA-seq dataset to cover as many RNA-sequences as possible, and I preprocessed them. My RNA sequences are in Fasta format, I indexed them, and here comes my question:
Should I first align the Ribo-seq and RNA-seq datasets to the reference Genome with its annotation (taken from GENCODE), and then align the aligned sequences to my indexed RNA sequences? or this is just a waste of time so I can directly align Ribo-seq and RNA-seq datasets to my indexed RNA sequences?
Thank you for your time.
Hi Manu Ayllon,
It is difficult to advise as I am unclear on the goal of your project. I will tell you what I understand and you can correct me where needed.
The overall goal is to develop a classifier that will accurately determine if a given RNA is translated or not. Meaning translated but not necessarily encoding a stable protein product.
To develop this classifier you want to take publicly available Ribo-Seq data and their paired RNA-Seq to obtain a set of translated RNA's on which you could start to train your model using sequence features (?).
Are you just looking at human data? I cannot tell from your blastn command. What kinds of RNAs are you investigating? I am unclear about the role of blastn and why you wouldn't just use a reference annotation. I am happy to help, just need a bit more info!