Question

R functions that extract the ORF from a sequence

0

Entering edit mode

9.2 years ago

peter.durr • 0

Hi everyone

I am working within R and need to extract the open reading frame (ORF) from a number of viral sequences

Somewhat to my surprise I have not yet been able to come across R functions within a package that find the ORF and readily extracts them.

Can anyone point me to R functions that will do these tasks?

Thanks

sequence • 9.4k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by peter.durr • 0

0

Entering edit mode

Why in R? There are many other possible and straightforward solutions available (bedtools, EMBOSS, etc)

ADD REPLY • link 9.2 years ago by Israel Barrantes ▴ 790

0

Entering edit mode

yes you are certinly correct - for sequence manipulation there are better tools

I am trying to do things in R because

of the downstream tools - especially for phylogenetics
I can code the total workflow into one replicable file

but.... in this case maybe R is not yet mature enough, and I will need to do the sequence manipulations outside of R and then work on a clean alignment for the analysis

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.2 years ago by peter.durr • 0

0

Entering edit mode

Are you attempting de novo prediction of all ORFs, or do you want to extract only the ORFs from known/annotated viruses?

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.2 years ago by Joseph Pearson ▴ 480

0

Entering edit mode

I am extracting from known viruses - actually segments of influenza viruses

the challenge arises is when I download a lot of them from Genbank, the segments will be of variable lengths

the five starting scenarios are:

complete segment length (about 1741 nt for segment 4)
complete coding sequence (about 1704 nt for segment 4)
missing regions to the left - with no start codon
missing regions to the right - with no stop codon
missing left and right - no start and stop codon

I am hoping to develop a workflow that can classify the sequences in to the 5 groups - and I was hoping I could build on existing code

thanks

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.2 years ago by peter.durr • 0

0

Entering edit mode

This might be a good starting point:

http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter7.html

The SequinR package has a number of functions that deal with the prediction of reading frames:

https://cran.r-project.org/web/packages/seqinr/seqinr.pdf

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by Joseph Hughes ★ 3.0k

score 4 · Accepted Answer · 2019-07-17

4

Entering edit mode

5.4 years ago

hauken_heyken ▴ 130

The R package ORFik in Bioconductor has all you need, implemented in C++ and even takes circular genomes.

ADD COMMENT • link 5.4 years ago by hauken_heyken ▴ 130