Question

Extracting fasta sequence in R

0

Entering edit mode

3.0 years ago

margo ▴ 40

I am wanting to extract the sequences in a fasta file as specific positions. I have downloaded my fasta file (f) in r by using the seqinr package.

f <- read.fasta("sequence.fasta")

I have a data frame (df1) that contains the positions I want to extract sequences at. My dataframe looks like this:

V1    strand 
6002  -
5679  - 
10123 -
7685  - 
13563 -
15588 -
16064 -
21042 -
..

What I am wanting to get is the sequence 50 back from this position and 5 ahead and output it to my data frame. Df1 contains the negative strand and I also have a Df2 which is the positive strand. I am wanting the extract these sequences from the fasta file for both strands. Does anyone know how I can do this?

Desired output would look like:

V1    strand sequence
6002  -      agggacc...
5679  -      ctttgac...
10123 -
7685  - 
13563 -
15588 -
16064 -
21042 -
..

dataframe R • 5.7k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 3.0 years ago by margo ▴ 40

0

Entering edit mode

Look into the stringr package https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf

I think you could use str_sub with your coordinates (n-5 to n+50 or n-50 to n+5, I can't tell from your post).

ADD REPLY • link 3.0 years ago by Trivas ★ 1.9k

0

Entering edit mode

Is "n" my V1 column?

ADD REPLY • link 3.0 years ago by margo ▴ 40

score 1 · Answer 1 · 2022-08-03

Example FASTA.

example.fasta

>seq1
GGGGGTCGAGACCCCAGCCACAAAGGTAACGCACATGTGAACGGGCCCCG

Example ranges.

ranges <- data.frame(V1=c(10, 25), strand=rep("-", 2))

With sequences in R you usually want to work with them as a Biostrings object.

library("Biostrings")

fasta <- readDNAStringSet("example.fasta")

> fasta
DNAStringSet object of length 1:
    width seq                                               names               
[1]    50 GGGGGTCGAGACCCCAGCCACAA...AACGCACATGTGAACGGGCCCCG seq1

Ranges in R should normally be stored as GRanges objects.

library("GenomicRanges")

ranges$seqnames <- "seq1"
ranges <- makeGRangesFromDataFrame(ranges, start.field="V1", end.field="V1")

> ranges
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     seq1        10      -
  [2]     seq1        25      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

You can then easily manipulate the ranges to the desired widths. Here I'll extend it 5 bases downstream and 10 bases upstream.

start(ranges) <- start(ranges) - 5
end(ranges) <- end(ranges) + 10

> ranges
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     seq1      5-20      -
  [2]     seq1     20-35      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Extracting the sequences are now simple.

library("BSgenome")

> getSeq(fasta, ranges)
DNAStringSet object of length 2:
    width seq
[1]    16 TGGCTGGGGTCTCGAC
[2]    16 TGTGCGTTACCTTTGT