Extracting fasta sequence in R
1
0
Entering edit mode
2.3 years ago
margo ▴ 40

I am wanting to extract the sequences in a fasta file as specific positions. I have downloaded my fasta file (f) in r by using the seqinr package.

f <- read.fasta("sequence.fasta")

I have a data frame (df1) that contains the positions I want to extract sequences at. My dataframe looks like this:

V1    strand 
6002  -
5679  - 
10123 -
7685  - 
13563 -
15588 -
16064 -
21042 -
..

What I am wanting to get is the sequence 50 back from this position and 5 ahead and output it to my data frame. Df1 contains the negative strand and I also have a Df2 which is the positive strand. I am wanting the extract these sequences from the fasta file for both strands. Does anyone know how I can do this?

Desired output would look like:

V1    strand sequence
6002  -      agggacc...
5679  -      ctttgac...
10123 -
7685  - 
13563 -
15588 -
16064 -
21042 -
..
dataframe R • 4.5k views
ADD COMMENT
0
Entering edit mode

Look into the stringr package https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf

I think you could use str_sub with your coordinates (n-5 to n+50 or n-50 to n+5, I can't tell from your post).

ADD REPLY
0
Entering edit mode

Is "n" my V1 column?

ADD REPLY
1
Entering edit mode
2.3 years ago

Example FASTA.

example.fasta

>seq1
GGGGGTCGAGACCCCAGCCACAAAGGTAACGCACATGTGAACGGGCCCCG

Example ranges.

ranges <- data.frame(V1=c(10, 25), strand=rep("-", 2))

With sequences in R you usually want to work with them as a Biostrings object.

library("Biostrings")

fasta <- readDNAStringSet("example.fasta")

> fasta
DNAStringSet object of length 1:
    width seq                                               names               
[1]    50 GGGGGTCGAGACCCCAGCCACAA...AACGCACATGTGAACGGGCCCCG seq1

Ranges in R should normally be stored as GRanges objects.

library("GenomicRanges")

ranges$seqnames <- "seq1"
ranges <- makeGRangesFromDataFrame(ranges, start.field="V1", end.field="V1")

> ranges
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     seq1        10      -
  [2]     seq1        25      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

You can then easily manipulate the ranges to the desired widths. Here I'll extend it 5 bases downstream and 10 bases upstream.

start(ranges) <- start(ranges) - 5
end(ranges) <- end(ranges) + 10

> ranges
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     seq1      5-20      -
  [2]     seq1     20-35      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Extracting the sequences are now simple.

library("BSgenome")

> getSeq(fasta, ranges)
DNAStringSet object of length 2:
    width seq
[1]    16 TGGCTGGGGTCTCGAC
[2]    16 TGTGCGTTACCTTTGT
ADD COMMENT

Login before adding your answer.

Traffic: 2351 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6